Scrapy is a robust, high-performance web scraping framework for Python. It simplifies the process of extracting structured data from websites, making it an essential tool for data scientists and developers alike.
Scrapy is an open-source framework that provides a complete suite of tools for efficient web crawling and scraping. It's designed to handle large-scale web scraping projects with ease, offering features like concurrent request handling and data extraction pipelines.
To create a simple spider, you need to define a class that inherits from scrapy.Spider
. Here's a basic example:
import scrapy
class SimpleSpider(scrapy.Spider):
name = 'simple_spider'
start_urls = ['http://example.com']
def parse(self, response):
yield {
'title': response.css('title::text').get(),
'h1': response.css('h1::text').get(),
}
This spider crawls the specified URL and extracts the title and first h1 tag from the page.
To run a spider, use the following command in your terminal:
scrapy runspider simple_spider.py
Scrapy uses CSS and XPath selectors to extract data. Here's an example of using both:
# CSS selector
title = response.css('title::text').get()
# XPath selector
h1 = response.xpath('//h1/text()').get()
For websites with multiple pages, you can implement pagination in your spider:
def parse(self, response):
# Extract data from current page
for item in response.css('div.item'):
yield {
'name': item.css('span.name::text').get(),
'price': item.css('span.price::text').get(),
}
# Follow pagination link
next_page = response.css('a.next-page::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
As you become more comfortable with Scrapy, explore these advanced topics:
Scrapy integrates well with other Python libraries. For more complex web scraping tasks, consider combining Scrapy with Python BeautifulSoup for additional parsing capabilities.
Scrapy provides a powerful framework for web scraping in Python. By mastering its basics, you'll be well-equipped to handle a wide range of web scraping projects efficiently. Remember to always scrape responsibly and ethically, respecting website policies and server resources.