Simple Web Scraping
Taking a work-related break from admin topics tonight and tomorrow night. So instead I'd like to introduce simple web scraping, which I have done my fair share of.
In a perfect world, all websites would have easy integration, whether it was via a formalized API, easily parsed documents or feeds via formats like Atom/RSS. We do not live in that world, though many modern websites are beginning to provide these pieces. Unfortunately, many websites (such as government ones) may have interesting data but also only provide the HTML. Fortunately, there are a couple Python modules that can help us get to this data and make our lives a whole lot easier.
First, let's grab the modules we need.
On Your Mark - httplib2
I used to be one of those "urllib
is good enough for me" types. But after being introduced to httplib2
, I am spoiled. If urllib
is a pocket knife, httplib2
is a death ray living on an orbital space station. Perhaps that's a bit sensationalist, but really, httplib2
is really awesome. Happy bits include timeouts, HTTPS, authentication, compression and more.
As it is not part of the Python standard library, getting can be done via easy_install/pip or via Subversion.
Now that we've got the library for transporting the content, let's grab the module we'll use to parse.
Get Set - BeautifulSoup
At first, you might be tempted to try to use an XML parser, thinking you may only have to work around a couple cases. Hold off that urge and look into BeautifulSoup
instead. BeautifulSoup
handles most XHTML/HTML with ease, even when the markup is awful.
Installing BeautifulSoup
is as easy to install as httplib2
, again either by easy_install/pip or through BeautifulSoup
's website.
We now have the components we need. Let's get parsing.
Scrape!
For an example, let's say you want a list of popular/anticipated DVD releases for the week. The website VideoETA features a very comprehensive list of new releases each week. However, their RSS feed is only news, not the releases themselves. So we'll scrape their listings for what we need. Here's the code:
#!/usr/bin/env python
import httplib2
from BeautifulSoup import BeautifulSoup
class PageNotFound(RuntimeError): pass
class DVDReleaseScraper(object):
def __init__(self, url="http://videoeta.com/week_video.html", timeout=15):
"""
Sets up a DVDReleaseScraper instance.
url should be a string pointing to the VideoEta Wekk Release url.
timeout should be an integer for the number of seconds to wait for a response.
"""
self.url, self.timeout = url, int(timeout)
def fetch_releases(self):
"""
Fetches the page of DVD releases.
Raises PageNotFound if the page could not successfully retrieved.
"""
http = httplib2.Http(timeout=self.timeout)
headers, content = http.request(self.url)
if headers.get('status') != '200':
raise PageNotFound("Could not fetch listings from '%s'. Got %s." % (self.url, headers['status']))
return content
def parse_releases(self, content):
"""Parse the page and return the releases."""
soup = BeautifulSoup(content)
releases = []
raw_releases = soup.findAll('table')[1].find('td').findAll('p')
# Skip the first and last paragraphs, as they are navigation.
for raw_release in raw_releases[1:-1]:
if not raw_release.a:
continue
release_info = {}
release_info['title'] = raw_release.a.b.contents[0].strip()
release_info['rating'] = raw_release.contents[2].strip().replace('(', '').replace(')', '')
if raw_release.find('blockquote'):
release_info['synopsis'] = raw_release.find('blockquote').contents[0].strip()
else:
release_info['synopsis'] = ''
releases.append(release_info)
return releases
def get_releases(self):
"""An convenience method to fetch and return releases."""
content = self.fetch_releases()
return self.parse_releases(content)
if __name__ == '__main__':
scraper = DVDReleaseScraper()
releases = scraper.get_releases()
for x in xrange(0, 5):
print releases[x]['title']
A good portion of this script is fairly straightforward. The interesting bits are the fetch_releases
and the parse_releases
methods. The fetch_releases
method simply leverages httplib2
, making a standard HTTP request for the URL with a default timeout of 15 seconds. If the page came back alright (status 200 OK), return the content of the page.
The parse_releases
method is a little more complex as we dive into BeautifulSoup
. We first load up the page content via BeautifulSoup
. We then use it to pull out the second table
it finds, then the first td
within that, then all of the p
tags it finds within that. From there, we iterate through all of the paragraphs. If it contains a link (the a
tag), we parse it further, grabbing out the title, rating and synopsis if we find it. We strip each of these because they may contain whitespace we probably don't want. Finally, we store this dictionary in our releases
list and return that list when we're done.
The result is a Python list of dictionaries, with each dictionary containing the title, rating and (optionally) the synopsis, ready for our use elsewhere! Our scraping is complete.
Notes About Scraping
First, be aware that scraping is notoriously fragile, as it depends largely on the exact layout of tags in the content. Since most pages don't change much through time, this usually isn't a huge problem. However, be wary of redesigns and code to handle failures (the example code could have been better in this regard, at the expense of easily seeing how to use BeautifulSoup
).
Also, if you're scraping, store/cache the data on your end as much as possible. If you're integrating the data into a website of your own, you should avoid hitting someone else's site on every page request. You'll save network overhead, parsing time, CPU cycles and save on their site as well. This should be a courtesy you should extend to someone providing the data. Either store the scraped data in a database or, at the bare minimum, pickle it
for your use.