Building a Web Scraper with Proxies in Scrapy: A Step-by-Step Guide

A4s1kk
4 min readAug 17, 2024

--

Welcome back, fellow coders!

Today, we’re diving into the world of web scraping using Scrapy, a powerful Python framework. We’ll explore how to build a spider that rotates user agents and proxies to scrape websites effectively while staying anonymous. If you’re new to web scraping or looking to enhance your skills, this post is for you.

Let’s break down the project, file by file, to help you understand how everything works together.

Project Structure

Here’s the file structure for our Scrapy project:

bot/

├── bot/
│ ├── __init__.py
│ ├── settings.py
│ └── spiders/
│ └── proxy_spider.py

├── proxy_list.txt
├── run_spider.py
└── user_agent_list.txt

Now, let’s dive into each file and see what it does.

1. bot/bot/settings.py

This file is the heart of the Scrapy project. It configures how the spider behaves, from handling proxies to obeying robots.txt rules.

# bot/settings.py

# Scrapy settings for bot project
BOT_NAME = 'bot'

SPIDER_MODULES = ['bot.spiders']
NEWSPIDER_MODULE = 'bot.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure a delay for requests for the same website
DOWNLOAD_DELAY = 2

# Proxy List File
PROXY_LIST = 'proxy_list.txt'

# Enable the scrapy-proxies middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

# Retry many times since proxies often fail
RETRY_TIMES = 10
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408, 429]

# Enable and configure the AutoThrottle extension
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
AUTOTHROTTLE_DEBUG = False

# User-Agent list
USER_AGENT_LIST = 'user_agent_list.txt'

Key points:

  • ROBOTSTXT_OBEY ensures that the scraper follows the rules specified in a website’s robots.txt file.
  • DOWNLOAD_DELAY adds a 2-second pause between requests to avoid overloading the server.
  • PROXY_LIST and USER_AGENT_LIST reference external files containing proxies and user-agent strings, ensuring that requests appear to come from different sources.
  • DOWNLOADER_MIDDLEWARES configures the middlewares responsible for handling proxies, retries, and random user agents.
  • AUTOTHROTTLE manages the request rate dynamically, preventing the scraper from overwhelming the target server.

2. bot/user_agent_list.txt

This file contains a list of user-agent strings that the spider can rotate through. It helps in avoiding detection as a bot.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.1 Safari/605.1.15

By mimicking different browsers, we reduce the chance of getting blocked.

3. bot/proxy_list.txt

This file holds a list of proxies in the format http://proxy:port. The spider will rotate through these proxies to mask its IP address and avoid IP bans.

http://proxy1:port
http://proxy2:port
http://proxy3:port

Make sure to update this list with reliable proxies for better scraping results.

4. bot/bot/spiders/proxy_spider.py

Here’s where the actual scraping happens. The ProxySpider class defines a simple Scrapy spider.

import scrapy

class ProxySpider(scrapy.Spider):
name = "proxy_spider"
start_urls = ["http://www.example.com"]

def parse(self, response):
self.log(f"Visited {response.url}")
self.log(response.text[:100]) # Log the first 100 characters of the response
  • name: The spider’s name, which you’ll use when running it.
  • start_urls: The list of URLs where the spider will start scraping.
  • parse: The method that processes the response from the website. In this example, it logs the visited URL and the first 100 characters of the response content.

5. bot/run_spider.py

This script ties everything together and runs the spider.

# run_spider.py

from scrapy.crawler import CrawlerProcess
from bot.spiders.proxy_spider import ProxySpider

process = CrawlerProcess(settings={
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
'RETRY_TIMES': 10,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429],
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
},
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_MAX_DELAY': 60,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 1.0,
'AUTOTHROTTLE_DEBUG': False,
'PROXY_LIST': 'proxy_list.txt',
'USER_AGENT_LIST': 'user_agent_list.txt'
})

process.crawl(ProxySpider)
process.start()
  • CrawlerProcess: Initializes a Scrapy process with custom settings.
  • process.crawl(ProxySpider): Starts crawling using the ProxySpider class.
  • process.start(): Kicks off the crawling process.

Get the full source code from here:

Conclusion

And that’s it! You’ve just built a web scraper using Scrapy, with proxy and user-agent rotation to avoid detection. This setup is powerful and can be extended to scrape data from various websites while staying under the radar.

Feel free to customize the spider, add more features, or scrape different websites. If you run into any challenges or have questions, drop a comment or reply to this email. I’d love to help you out!

Ready to Level Up Your Python Skills?

EscapeMantra: The Ultimate Python Ebook” is here to guide you through every step of mastering Python. Whether you’re new to coding or looking to sharpen your skills, this ebook is packed with practical examples, hands-on exercises, and real-world projects to make learning both effective and enjoyable.

Here’s what you’ll get:

  • Clear Explanations: Understand Python concepts easily with straightforward guidance.
  • Engaging Projects: Work on fun projects like a Snake game and an AI Chatbot to apply what you’ve learned.
  • Hands-On Practice: Build your skills with exercises designed to boost your confidence.

👉 Grab your copy. Dive in today and start mastering Python at your own pace. Don’t wait — your programming journey starts now!

🚀 Support My Work and Get More Exclusive Content! 🚀

If you found article helpful and want to see more in-depth content, tools, and exclusive resources, consider supporting me on Patreon. Your support helps me create and share valuable content, improve projects, and build a community of passionate developers.

👉 Become a Patron Today! Join here to access exclusive source codes, early project releases, and more!

Thank you for your support and for being part of this journey!

--

--