2009-06-22

Web scraping with Python for fun and profit

Web is everywhere, we know. It is also used more and more to present information to a wide audience. Sadly, it is commonly the only way data is presented...

That said, we need to get that info; the process of extracting information from web pages is knows as web scraping, and note that's is a very fragile process: every time the webpage changes, it's likely you'll have to modify the code that parses it.

The probably most famous Python module to do web scraping is BeautifulSoup. While it might be nice for simple webpages, I found it really hard to get something done for more complex pages, in particular with those with JavaScript embedded.

Thanks to Ian blogpost, I discovered how nice is to use lxml to do web scraping, in particular in association with Firebug Firefox addon: it's just a simple process of:
  1. take the page;
  2. generate the lxml tree;
  3. with Firebug find the XPath to the element you need;
  4. loop / parse / have fun :)
If you find in need to web scrape a page, give lxml a try: you'll be surprised and satisfied!

6 comments:

jldugger said...

I've recently looked into this, and my own challenge has been HTML that isn't valid, even after Tidy. This is the sort of thing that Beautiful Soup is supposed to help with, no?

vasi said...

I've had much success using hpricot and ruby.

Sandro Tosi said...

@jldugger: nowdays, more and more webpages are automatically generated from CMS, so invalid html (or generally xml) docs are rare, and I've never faced any of them.

@vasi: but that would fail the "python" assumption in the subject ;)

Olivier Berger said...

Maybe, if your thoughts come from your experience in bts-link work, I can understand clearly what you mean :-)

My idea is that someday, the apps will speak Semantic Web standards like RDF(a) and provide ways to extract data without scraping.

At least for bugtrackers that's what we're after : http://www-public.it-sudparis.eu/~berger_o/weblog/tag/semantic-web/

Sandro Tosi said...

@Olivier: well, no: it comes from "payed work" needs :)

my boss wanted a script to parse some webpages tables and generate an Excel sheet. With b.s. I tried and tried failing, while with lxml I got it done at the first try.

Tucanae Services said...

Add PyQuery on top of lxml and your ability to scrape increase yet again. If I have to scrape that is the combination of tools I use.