test.ical.ly | getting the web by the balls



Using PHP Web Scraper Goutte in a Console Task in a Silex project

Since I discovered the free Facebook App hosting by heroku I keep wanting to make something useful out of it. So I thought about a small service app. Without going into details yet about its nature there was one immediate problem to be solved. How to get hold of the data? So I thought to scrape it off some website. I know this isn’t very nice but unfortunately there is no feed I can use.. And how to best scrape a website?

Use Goutte!

Goutte is yet another great tool provided by Fabien Potencier and if you ever need to pull in data from a website which wouldn’t validate for XML you should definitely check it out! If you want a bit of context read Ryan Weavers article on php|architect.

Since I already decided to use Silex for this app I needed to find a) a way to create console tasks and b) use Goutte in one of them.

The first problem was solved fast by re-reading a blog post by Kevin Boyd who documented exactly this. )

Following his instruction my project structure now looks like this.

So this is only a very small change I made to my already existing pet project.

So what was left was to find a way to integrate Goutte smoothly into this so I could use it.

First I downloaded the goutte.phar to src/goutte.phar.

Second I created a custom Silex service provider for Goute in src/Caefer/GoutteServiceProvider.php.

I didn’t use the autoload facility of the Goutte phar file as it would’ve thrown an error of defining UniversalClassLoader a second time. That’s why I had to register the namespace for the Zend library manually in the provider.

And last I modified bootstrap.php to know my Caefer namespace and register the service provider.

Now in my console.php I can use the Goutte client as follows.

Happy scraping!

· · · · ·

  • Kyle

    Thank you for this post. Is there anyway you can post the code?

    Thanks again!

  • @Kyle but everything is already there? Just follow the instruction (the ones on the linked blog post first) and you’re done. If I would provide the sources here they would outdate soon as Silex, twig and goutte are still being developed further.



Theme Design by devolux.nh2.me