Using PHP Web Scraper Goutte in a Console Task in a Silex project
Since I discovered the free Facebook App hosting by heroku I keep wanting to make something useful out of it. So I thought about a small service app. Without going into details yet about its nature there was one immediate problem to be solved. How to get hold of the data? So I thought to scrape it off some website. I know this isn’t very nice but unfortunately there is no feed I can use.. And how to best scrape a website?
Goutte is yet another great tool provided by Fabien Potencier and if you ever need to pull in data from a website which wouldn’t validate for XML you should definitely check it out! If you want a bit of context read Ryan Weavers article on php|architect.
Since I already decided to use Silex for this app I needed to find a) a way to create console tasks and b) use Goutte in one of them.
The first problem was solved fast by re-reading a blog post by Kevin Boyd who documented exactly this. )
Following his instruction my project structure now looks like this.
So this is only a very small change I made to my already existing pet project.
So what was left was to find a way to integrate Goutte smoothly into this so I could use it.
First I downloaded the goutte.phar to src/goutte.phar.
Second I created a custom Silex service provider for Goute in src/Caefer/GoutteServiceProvider.php.
I didn’t use the autoload facility of the Goutte phar file as it would’ve thrown an error of defining UniversalClassLoader a second time. That’s why I had to register the namespace for the Zend library manually in the provider.
And last I modified bootstrap.php to know my Caefer namespace and register the service provider.
Now in my console.php I can use the Goutte client as follows.