20 Scrapy concepts with Before-and-After Examples
1. Creating a Scrapy Project 📁
Boilerplate Code:
scrapy startproject myproject
Use Case: Initialize a new Scrapy project. 📁
Goal: Set up the basic structure for your Scrapy project. 🎯
Sample Command:
scrapy startproject myproject
Before Example:
You need to scrape data but don’t have a project structure. 🤔
No project directory.
After Example:
With scrapy startproject, you get a fully scaffolded project directory! 📁
myproject/
├── myproject/
├── scrapy.cfg
└── ...
Challenge: 🌟 Try creating multiple Scrapy projects and see how the project structure varies with different settings.
2. Creating a Spider 🕷️
Boilerplate Code:
scrapy genspider spider_name domain.com
Use Case: Create a Spider to scrape a specific website. 🕷️
Goal: Set up a spider that defines how to crawl and parse a website. 🎯
Sample Command:
scrapy genspider myspider example.com
Before Example:
You want to scrape a website but don’t have a spider defined. 🤔
No spider available.
After Example:
With scrapy genspider, you generate a spider file ready to customize! 🕷️
myproject/spiders/myspider.py
Challenge: 🌟 Try creating spiders for multiple domains and define the rules for each.
3. Running a Spider 🏃♂️
Boilerplate Code:
scrapy crawl spider_name
Use Case: Use crawl to run your Scrapy spider. 🏃♂️
Goal: Execute the spider to crawl and scrape data from the target site. 🎯
Sample Command:
scrapy crawl myspider
Before Example:
You’ve written your spider but don’t know how to execute it. 🤔
Spider exists, but no data collected.
After Example:
With scrapy crawl, the spider runs, scrapes, and collects data! 🏃♂️
Data is collected and printed or stored.
Challenge: 🌟 Run the spider with the -o option to save scraped data into a file (e.g., json, csv).
4. Parsing Responses (parse method) 🔍
Boilerplate Code:
def parse(self, response):
# Extract data here
pass
Use Case: Define the parse method to handle the data extracted from responses. 🔍
Goal: Extract data from the HTML content of the page. 🎯
Sample Code:
def parse(self, response):
title = response.css('title::text').get()
yield {'title': title}
Before Example:
You have a spider that crawls pages but doesn’t extract specific data. 🤔
HTML response is received but no data extracted.
After Example:
With parse, you extract specific elements from the page! 🔍
Extracted data: {"title": "Example Title"}
Challenge: 🌟 Try extracting multiple fields like headers, paragraphs, or links using CSS or XPath selectors.
5. CSS Selectors (response.css) 🌐
Boilerplate Code:
response.css('css_selector')
Use Case: Use CSS selectors to locate elements within the HTML response. 🌐
Goal: Select and extract data using CSS-like syntax. 🎯
Sample Code:
title = response.css('title::text').get()
Before Example:
You have an HTML response but can’t efficiently extract specific elements. 🤔
Data: <title>Example Title</title>
After Example:
With CSS selectors, you can easily extract the desired text or attributes! 🌐
Output: "Example Title"
Challenge: 🌟 Use CSS selectors to extract different elements such as images (img::attr(src)), links (a::attr(href)), or text.
6. XPath Selectors (response.xpath) 🧭
Boilerplate Code:
response.xpath('xpath_expression')
Use Case: Use XPath selectors to extract elements from the HTML response. 🧭
Goal: Use powerful XPath expressions for more flexible or complex queries. 🎯
Sample Code:
title = response.xpath('//title/text()').get()
Before Example:
You need to extract elements but CSS selectors are not flexible enough. 🤔
Data: <title>Example Title</title>
After Example:
With XPath, you can extract data using more complex queries! 🧭
Output: "Example Title"
Challenge: 🌟 Try using XPath to extract nested elements or multiple attributes in a single query.
7. Extracting Links (response.follow) 🔗
Boilerplate Code:
response.follow(link, callback)
Use Case: Use follow to navigate to links and scrape multiple pages. 🔗
Goal: Extract links from a page and follow them to scrape additional pages. 🎯
Sample Code:
for href in response.css('a::attr(href)').getall():
yield response.follow(href, self.parse)
Before Example:
Your spider scrapes a single page but doesn’t navigate to other linked pages. 🤔
Only the first page is scraped.
After Example:
With response.follow, you can follow links and scrape multiple pages! 🔗
The spider navigates and scrapes linked pages.
Challenge: 🌟 Try following only specific links, such as those that contain certain keywords or paths.
8. Storing Data (Item Pipeline) 📊
Boilerplate Code:
class MyItemPipeline:
def process_item(self, item, spider):
# Process and store the item
return item
Use Case: Use item pipelines to store or process the scraped data. 📊
Goal: Define how scraped data should be processed and stored after extraction. 🎯
Sample Code:
class MyItemPipeline:
def process_item(self, item, spider):
# Save item to a file or database
with open('output.txt', 'a') as f:
f.write(f"{item}\n")
return item
Before Example:
You’ve extracted data but have no way to store or process it. 🤔
Scraped data is printed but not saved.
After Example:
With pipelines, you can process and store data in files, databases, etc.! 📊
Output: Data is saved to a file or database.
Challenge: 🌟 Try implementing pipelines to save data in formats like CSV or JSON.
9. Defining Items (Item Class) 📋
Boilerplate Code:
from scrapy import Item, Field
Use Case: Define a structured Item to represent the data you are scraping. 📋
Goal: Organize the scraped data into a structured format. 🎯
Sample Code:
class MyItem(Item):
title = Field()
link = Field()
Before Example:
You’ve scraped data but don’t have a structured format to represent it. 🤔
Unstructured data extraction.
After Example:
With Item, your data is organized into fields for better structure and processing! 📋
Structured data: {"title": "Example", "link": "https://example.com"}
Challenge: 🌟 Try defining multiple fields and extract values for each one using CSS or XPath.
10. Handling Pagination (next page) 🔄
Boilerplate Code:
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Use Case: Handle pagination to scrape data across multiple pages. 🔄
Goal: Automatically navigate through paginated content to collect more data. 🎯
Sample Code:
def parse(self, response):
# Extract data from the current page
yield {'title': response.css('title::text').get()}
# Follow the pagination link
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Before Example:
Your spider scrapes only the first page of a paginated website. 🤔
Data is limited to the first page.
After Example:
With pagination handling, the spider follows links and scrapes additional pages! 🔄
Data collected from multiple pages.
Challenge: 🌟 Try handling pagination where the "next" button has different forms (e.g., buttons, JavaScript events).
11. Configuring Settings (Settings Module) ⚙️
Boilerplate Code:
from scrapy.utils.project import get_project_settings
Use Case: Use the settings module to configure how Scrapy runs. ⚙️
Goal: Adjust settings like user-agent, download delays, and more. 🎯
Sample Code:
settings = get_project_settings()
settings.set('USER_AGENT', 'Mozilla/5.0 (compatible; MyScrapyBot/1.0)')
Before Example:
Your spider runs with default settings, like a default user-agent, causing potential blocking. 🤔
Scrapy default settings in use.
After Example:
With custom settings, you can fine-tune spider behavior like user-agent and download delays! ⚙️
Custom user-agent or settings applied.
Challenge: 🌟 Try adding download delays to prevent being blocked by websites (DOWNLOAD_DELAY = 2).
12. Handling Cookies (COOKIES_ENABLED) 🍪
Boilerplate Code:
settings.set('COOKIES_ENABLED', True)
Use Case: Enable or disable cookies in your Scrapy project. 🍪
Goal: Control how your spider handles cookies for session-based scraping. 🎯
Sample Code:
settings.set('COOKIES_ENABLED', True)
Before Example:
Your spider struggles to maintain a session because cookies are not handled. 🤔
Session information is lost.
After Example:
With cookies enabled, your spider maintains sessions correctly across requests! 🍪
Session data maintained via cookies.
Challenge: 🌟 Try scraping a website that requires login using cookies to maintain the session.
13. Customizing Request Headers (headers) 📜
Boilerplate Code:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Accept-Language': 'en'
}
yield scrapy.Request(url, headers=headers)
Use Case: Customize headers in your requests to mimic real browser behavior. 📜
Goal: Avoid detection by websites and mimic genuine users. 🎯
Sample Code:
# Send a request with custom headers
headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language': 'en'}
yield scrapy.Request(url="https://example.com", headers=headers)
Before Example:
Your spider is blocked due to a missing or default user-agent. 🤔
Request blocked by server.
After Example:
With custom headers, your spider mimics a real browser request! 📜
Request accepted with custom headers.
Challenge: 🌟 Experiment with different headers like Referer and Accept-Encoding to bypass bot detection.
14. Downloading Files (media) 📂
Boilerplate Code:
yield scrapy.Request(url, callback=self.save_file)
Use Case: Use Scrapy to download files like images or PDFs from the web. 📂
Goal: Automate the process of downloading media files from web pages. 🎯
Sample Code:
def save_file(self, response):
filename = response.url.split("/")[-1]
with open(filename, 'wb') as f:
f.write(response.body)
Before Example:
You manually download files, which is time-consuming. 🤔
Files manually downloaded.
After Example:
With Scrapy, files are automatically downloaded and saved! 📂
Files automatically saved to your system.
Challenge: 🌟 Try downloading multiple file types (e.g., images, PDFs, audio) from a website.
15. Using CrawlSpider (CrawlSpider Class) 🕸️
Boilerplate Code:
from scrapy.spiders import CrawlSpider, Rule
Use Case: Use CrawlSpider to handle more complex crawling, with automatic link extraction. 🕸️
Goal: Define rules to crawl a website efficiently, automatically following links. 🎯
Sample Code:
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'my_crawler'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(allow=('category/',)), callback='parse_item')]
def parse_item(self, response):
# Extract data
yield {'title': response.css('title::text').get()}
Before Example:
Your spider requires manual coding to follow links and extract data. 🤔
Manually coded link following.
After Example:
With CrawlSpider, link extraction and crawling are automated! 🕸️
Automatic crawling and data extraction based on rules.
Challenge: 🌟 Define multiple rules for different types of links and customize crawling behavior.
16. Throttling Requests (AUTOTHROTTLE) ⏳
Boilerplate Code:
settings.set('AUTOTHROTTLE_ENABLED', True)
Use Case: Enable AutoThrottle to control the speed of requests dynamically. ⏳
Goal: Prevent being blocked by websites by adjusting request rates. 🎯
Sample Code:
settings.set('AUTOTHROTTLE_ENABLED', True)
settings.set('AUTOTHROTTLE_START_DELAY', 1)
settings.set('AUTOTHROTTLE_MAX_DELAY', 10)
Before Example:
Your spider sends too many requests too quickly, getting blocked by websites. 🤔
Website blocks requests due to high volume.
After Example:
With AutoThrottle, your spider automatically adjusts request speed to avoid detection! ⏳
Spider adapts to avoid being blocked.
Challenge: 🌟 Try combining AutoThrottle with a proxy or user-agent rotation to further avoid detection.
17. Handling Redirects (REDIRECT_ENABLED) 🔄
Boilerplate Code:
settings.set('REDIRECT_ENABLED', False)
Use Case: Control how your spider handles redirects (enable/disable). 🔄
Goal: Decide whether to follow redirects or handle them manually. 🎯
Sample Code:
settings.set('REDIRECT_ENABLED', False) # Prevent following redirects
Before Example:
Your spider follows redirects, leading to pages you don't want to scrape. 🤔
Unwanted redirects followed.
After Example:
With redirects disabled, your spider stays on the original page and handles redirects manually! 🔄
Redirects are not automatically followed.
Challenge: 🌟 Try enabling redirects and handling specific redirects programmatically.
18. Rotating User Agents (FAKE USER AGENT) 🔄
Boilerplate Code:
from fake_useragent import UserAgent
Use Case: Rotate user agents to avoid detection by websites. 🔄
Goal: Prevent being blocked by websites that monitor for bots with static user agents. 🎯
Sample Code:
from fake_useragent import UserAgent
def start_requests(self):
ua = UserAgent()
headers = {'User-Agent': ua.random}
yield scrapy.Request(url='https://example.com', headers=headers)
Before Example:
You use the same user-agent for all requests, making it easy for websites to detect you as a bot. 🤔
Static user-agent leads to detection.
After Example:
With rotating user agents, you reduce the chance of being detected! 🔄
User-agent rotated for each request.
Challenge: 🌟 Try using multiple user-agent strings and test different websites to see which are most effective.
19. Logging (LOG_LEVEL) 📝
Boilerplate Code:
settings.set('LOG_LEVEL', 'INFO')
Use Case: Set the log level to control the verbosity of Scrapy’s logging. 📝
Goal: Adjust the level of logging (e.g., DEBUG, INFO, WARNING, ERROR). 🎯
Sample Code:
settings.set('LOG_LEVEL', 'DEBUG') # Show detailed logging info
Before Example:
Your logs are too verbose or too quiet, making it hard to debug or monitor the spider. 🤔
Irrelevant or missing log data.
After Example:
With log level control, you see only the logs you need! 📝
Logs set to "DEBUG" for detailed information.
Challenge: 🌟 Experiment with different log levels and monitor how your spider behaves in each case.
20. Middleware (Custom Middleware) ⚙️
Boilerplate Code:
python
class MyCustomMiddleware:
def process_request(self, request, spider):
# Custom request processing logic
return None
Use Case: Write custom middleware to modify requests or responses before/after they are handled. ⚙️
Goal: Intercept and modify requests or responses dynamically during scraping. 🎯
Sample Code:
class MyCustomMiddleware:
def process_request(self, request, spider):
# Add a custom header to all requests
request.headers['Custom-Header'] = 'MyValue'
return None
Before Example:
You need to modify requests/responses dynamically, but there’s no built-in feature for your use case. 🤔
Static request handling.
After Example:
With middleware, you can intercept and modify requests or responses as needed! ⚙️
Custom headers added to all requests.
Challenge: 🌟 Try using middleware to retry failed requests or handle custom error conditions.