March 13, 2012

Recursively Scraping A Blog With Scrapy

Scrapy is a web crawling and scraping framework
written in python. The framework is really simple to understand and easy
to get started with. If you know little bit of Python, you should be
able to build your own web scraper within few minutes. In this post I’m
going to describe how you can use Scrapy to build recursive blog
crawler. Building a recursive scraper using Scrapy is pretty simple, yet
it’s getting started guide doesn’t help people who are unfamiliar with
the framework to write a recursive scraper.

I’m going to use blog.scrapy.org as the target
blog and techniques I am discussing will work on other sites as well
with simple changes to information extraction queries. If you are new to
Scrapy it’s required to read the Scrapy tutorial first.
I also assume that you have Scrapy installed in your machine.

Let’s create a Scrapy project first using following command:

scrapy startproject scrapy_sample

Defining the scrapy item #

Next step is to define the Item which is the container Scrapy spider
used to store the scraped data. I’m going to extract the blog post link,
post title and text content of the post. So my Item definition will look like
following.

from scrapy.item import Item, Field

class ScrapySampleItem(Item):
    title = Field()
    link = Field()
    content = Field()

Implementing the spider #

Our spider will define initial URL to download content from, how to
follow pagination links and how to extract blog posts in a page and
creating items from the posts.

Your spider class must be a subclass of scrapy.spider.BaseSpider and
you need to define three main mandatory attributes.

name : Unique identifier for the spider
start_urls : List of URLs to begin crawling
parse() : Method which will be called with the downloaded Response object for each start URL. Code related to parsing and data extraction will go under this.

Our spider implementation will look like following:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy_sample.items import ScrapySampleItem

class ScrapyOrgSpider(BaseSpider):
    name = "scrapy"
    allowed_domains = ["scrapy.org"]
    start_urls = ["http://blog.scrapy.org/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        next_page =
            hxs.select("//div[@class='pagination']/a[@class='next_page']/@href").extract()
        if not not next_page:
            yield Request(next_page[0], self.parse)

        posts = hxs.select("//div[@class='post']")
        items = []
        for post in posts:
            item = ScrapySampleItem()
            item["title"] = post.select("div[@class='bodytext']/h2/a/text()").extract()
            item["link"] = post.select("div[@class='bodytext']/h2/a/@href").extract()
            item["content"] = post.select("div[@class='bodytext']/p/text()").extract()
            items.append(item)
        for item in items:
            yield item

First we create HtmlXpathSelector giving the response object.
This will allow you to select elements in response HTML using XPath selectors. Then we extract the link to the next page of the blog using “//div[@class=‘pagination’]/a[@class=‘next_page’]/@href” XPath selector and selector you need to use in your code will depend on the web site you are going to crawl. Once we get the URL of the next page we check whether there are any URLs in the retirned list by selector, because last page will not have a next page link and Scrapy will throw a error when tried to go to empty URL while in the last page of the crawl. Main trick here is we are returning a python generator for the recursive call. You can learn more about reason behind this from this stackoverflow conversation. Last thing we are doing inside our parse method is extracting blog posts in the current page and creating list of Scrapy Items for blog posts.

Running the scraper #

Now you can execute your scraper by running following command while in
the root directory of your Scrapy project.

scrapy crawl scrapy

Scrapy allows you to save the scraped items into a JSON formatted file.
All you have to do is add -o filename.json -t json option to previous
crawl command. This will save the scraped items into a JSON file with
the given name.

You can find more information about Scrapy from here. I strongly recommend you to read the full documentation if you like to dig deeper into Scrapy.

Source code for the sample can be found here.

This post was moved from my old blog.

170

Kudos

170

Kudos

Recursively Scraping A Blog With Scrapy

Defining the scrapy item #

Implementing the spider #

Running the scraper #

Now read this

Beyond Singularity