Dev Series
Scrapy 101: Architecture and Lifecycle
Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling websites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow that takes place inside the system.
Spiders
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as a callback function for the Requests.
In the callback function, you parse the response (web page) and return item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml, or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Walk-through of an example spider
In order to show you what Scrapy brings to the table, we’ll walk you through an example of a Scrapy Spider using the simplest way to run a spider.
Here’s the code for a spider that scrapes famous quotes from the website https://quotes.toscrape.com, following the pagination:
import scrapyclass QuotesSpider(scrapy.Spider):
name = ‘quotes’
start_urls = [
‘https://quotes.toscrape.com/tag/humor/',
]def parse(self, response):
for quote in response.css(‘div.quote’):
yield {
‘author’: quote.xpath(‘span/small/text()’).get(),
‘text’: quote.css(‘span.text::text’).get(),
}next_page = response.css(‘li.next a::attr(“href”)’).get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Put this in a text file, name it something like qoutes_spider.py
and run the spider using the runspider
command:
scrapy runspider quotes_spider.py -o quotes.jl
When this finishes you will have in the qoutes.jl
file a list of the quotes in JSON Lines format, containing text and author, looking like this:
{“author”: “Jane Austen”, “text”: “\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d”}
{“author”: “Steve Martin”, “text”: “\u201cA day without sunshine is like, you know, night.\u201d”}
{“author”: “Garrison Keillor”, “text”: “\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d”}
…
When you ran the command scrapy runspider qoutes_spider.py
, Scrapy looked for a Spider definition inside it and ran it through its crawler engine.
The crawl started by making requests to the URLs defined in the start_urls attribute (in this case, only the URL for quotes in the humor category) and called the default callback method parse, passing the response object as an argument. In the parse callback, we loop through the quote elements using a CSS Selector, yield a Python dict with the extracted quote text and author, look for a link to the next page and schedule another request using the same parse method as a callback.
Here you notice one of the main advantages of Scrapy: requests are scheduledYou start by generating the initial Requests to crawl the first URLs and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as a callback function for the Requests.
In the callback function, you parse the response (web page) and return item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml, or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.