Mastering Crawlee and Cheerio: Complete Guide 2025
Introduction

In the ever-evolving digital landscape, access to data has become essential for businesses and researchers alike. Web scraping provides a powerful method for extracting valuable information from websites, enabling market research, competitive analysis, price monitoring, and content aggregation at scale.
Crawlee has emerged as one of the most powerful and flexible open-source web scraping libraries for JavaScript and Node.js. When combined with Cheerio, a fast and lightweight implementation of jQuery for server-side HTML parsing, developers gain a robust toolkit for building efficient, scalable web scrapers that can handle everything from simple static websites to complex, JavaScript-heavy applications.
Key Features
Multiple Crawler Types
CheerioCrawler for static content, PlaywrightCrawler for dynamic JavaScript-rendered pages, and more.
Automatic Request Queueing
Efficiently manages request queues with features like deduplication and priority handling.
Session Management
Sophisticated session rotation and management to mimic real users and avoid detection.
Proxy Integration
Built-in support for proxy rotation and retry mechanisms to handle blocking and rate limiting.
Installation and Setup
Getting started with Crawlee is straightforward. Follow these steps to install and set up the library in your Node.js development environment.
# Create a new project mkdir my-crawler cd my-crawler npm init -y # Install Crawlee and its dependencies npm install crawlee cheerio # If you need browser automation npm install playwright
Crawlee offers several crawler types, each tailored to different scraping needs:
- CheerioCrawler: Perfect for scraping static HTML content, using Cheerio for parsing.
- PlaywrightCrawler: Ideal for sites with dynamic JavaScript content that requires a browser to render.
- PuppeteerCrawler: An alternative browser-based crawler using Puppeteer instead of Playwright.
- JSDOMCrawler: Uses JSDOM for parsing HTML, offering a middle ground between Cheerio and browser-based crawlers.
Crawlee with Cheerio: Building Web Scrapers
Crawlee is a powerful web scraping and browser automation library that makes it easy to build scrapers for any website. When combined with Cheerio, it provides an efficient way to parse and extract data from HTML. Here's how to get started:
Basic Crawlee Setup with Cheerio
Let's start with a basic example of using CheerioCrawler to scrape a website's titles. This approach is perfect for lightweight scraping tasks where you don't need to render JavaScript.
import { CheerioCrawler } from 'crawlee';
// Create a basic crawler
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $ }) => {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
},
});
// Start crawling
await crawler.run(['https://example.com']);
Advanced Crawlee Example: Product Scraping
Now let's look at a more complex example that scrapes product data from an e-commerce site. This example shows how to extract structured data and save it to a dataset.
import { CheerioCrawler, Dataset } from 'crawlee';
// Create a dataset to store the results
const dataset = await Dataset.open('product-data');
const crawler = new CheerioCrawler({
// Use a maximum concurrency of 10 requests
maxConcurrency: 10,
// Function called for each page
requestHandler: async ({ request, $ }) => {
// Extract data from the page
const products = $('.product-item').map((i, el) => {
const $el = $(el);
return {
title: $el.find('.product-title').text().trim(),
price: $el.find('.product-price').text().trim(),
imageUrl: $el.find('img').attr('src'),
url: $el.find('a').attr('href'),
id: $el.attr('data-product-id'),
};
}).get();
// Save the data to the dataset
await dataset.pushData({
url: request.url,
products,
extractedAt: new Date().toISOString(),
});
// Find links to other product pages
const nextPageLinks = $('.pagination a')
.map((i, el) => $(el).attr('href'))
.get();
// Add discovered links to the queue
await crawler.addRequests(nextPageLinks);
},
});
// Start the crawler
await crawler.run(['https://example-eshop.com/products']);
Using Proxies with Crawlee
When scraping at scale, you often need to use proxies to avoid IP bans. Here's how to configure Crawlee to work with proxies:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
// Set up proxy configuration
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://username:password@proxy1.coronium.io:8000',
'http://username:password@proxy2.coronium.io:8000',
'http://username:password@proxy3.coronium.io:8000',
],
});
const crawler = new CheerioCrawler({
// Use the proxy configuration
proxyConfiguration,
// Additional options to handle proxy rotation
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 5, // Rotate sessions after 5 uses
},
},
requestHandler: async ({ request, $ }) => {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
// You can access the proxy used for this request
const proxyInfo = request.userData.proxyInfo;
if (proxyInfo) {
console.log(`Used proxy: ${proxyInfo.url}`);
}
},
});
await crawler.run(['https://example.com']);
Handling AJAX and JavaScript Content
While CheerioCrawler is great for static content, many modern websites rely on JavaScript to load their content. For these cases, you can use PlaywrightCrawler:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Playwright provides a headless browser for JavaScript-rendered content
headless: true,
requestHandler: async ({ page, request }) => {
// Wait for content to load
await page.waitForSelector('.dynamic-content');
// Extract data after JavaScript executes
const title = await page.title();
const price = await page.$eval('.price', (el) => el.textContent);
const description = await page.$eval('.description', (el) => el.textContent);
// Save the data
await Dataset.pushData({
url: request.url,
title,
price,
description,
});
},
});
await crawler.run(['https://javascript-heavy-site.com/products']);
These examples demonstrate the flexibility and power of Crawlee for various web scraping scenarios, from simple static sites to complex JavaScript applications.
Why Crawlee?
Crawlee offers significant advantages over traditional web scraping libraries, making it the tool of choice for modern web data extraction needs. Here's why developers choose Crawlee:
- Performance: Optimized for speed and efficiency, handling concurrent requests intelligently to maximize throughput while respecting website limitations.
- Scalability: Designed to scale from small projects to enterprise-level data extraction operations, handling millions of pages without performance degradation.
- Reliability: Built-in retry mechanisms, proxy rotation, and error handling ensure your scraper continues working even when facing temporary issues.
- Developer Experience: Clean API design, comprehensive documentation, and active community support make development faster and more enjoyable.
- Flexibility: Can be adapted to virtually any scraping scenario, from simple data collection to complex workflows with multiple stages.
By addressing common challenges in web scraping like blocking, rate limiting, and handling various content types, Crawlee provides a robust foundation for building reliable data extraction pipelines.
Use Cases and Examples
E-commerce Price Monitoring
Track competitor prices across multiple websites to optimize your pricing strategy and maintain competitiveness in real-time.
Content Aggregation
Build news aggregators, RSS readers, or specialized content platforms by extracting articles and media from multiple sources.
Lead Generation
Extract business contact information from directories, social media platforms, and company websites to build targeted lead databases.
SEO Monitoring
Track search rankings, analyze competitor content, and monitor backlinks to improve your website's search engine visibility.
// Example: Building a simple news aggregator
import { CheerioCrawler, Dataset } from 'crawlee';
// Define the news sites to scrape
const startUrls = [
'https://news-site1.com',
'https://news-site2.com',
'https://news-site3.com'
];
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100, // Limit the number of requests
// Define how to process each page
requestHandler: async ({ request, $, log }) => {
log.info(`Processing ${request.url}...`);
// Extract articles using site-specific selectors
const articles = $('.article-card').map((i, el) => {
const $article = $(el);
return {
title: $article.find('h2').text().trim(),
summary: $article.find('.summary').text().trim(),
url: new URL($article.find('a').attr('href'), request.url).href,
source: new URL(request.url).hostname,
publishedAt: $article.find('.date').text().trim(),
scrapedAt: new Date().toISOString(),
};
}).get();
// Save the data
await Dataset.pushData({
url: request.url,
articles,
});
// Follow pagination links
const nextPageUrl = $('.pagination .next').attr('href');
if (nextPageUrl) {
await crawler.addRequests([new URL(nextPageUrl, request.url).href]);
}
},
});
// Start the crawl
await crawler.run(startUrls);
// Export the results
await Dataset.exportToJSON('news-articles');
Best Practices for Crawlee and Cheerio
To build effective and ethical web scrapers with Crawlee and Cheerio, follow these best practices:
Performance Optimization
- Use appropriate concurrency settings based on the target website's capacity.
- Implement proper caching strategies to avoid redundant requests.
- Choose the right crawler type for your use caseโCheerioCrawler for static content, PlaywrightCrawler for JavaScript-heavy sites.
- Use request batching and data storage streaming for large-scale scraping.
Ethical Scraping
- Always respect robots.txt directives and website terms of service.
- Implement reasonable rate limiting to avoid overloading servers.
- Include proper user agent identification in your requests.
- Consider using the site's official API if one is available.
Error Handling and Resilience
- Implement comprehensive error handling for network issues, parsing errors, and timeouts.
- Use retries with exponential backoff for transient failures.
- Regularly monitor your scrapers to ensure they continue functioning as websites evolve.
- Implement alerts for unusual error rates or changes in data structure.
// Example of implementing best practices
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
// Ethical rate limiting and proxy rotation
const crawler = new CheerioCrawler({
// Respect server capacity
maxConcurrency: 5,
maxRequestsPerMinute: 30,
// Use proxies to distribute load
proxyConfiguration: new ProxyConfiguration({
proxyUrls: ['http://proxy1.example.com', 'http://proxy2.example.com'],
}),
// Identify your scraper properly
useSessionPool: true,
sessionPoolOptions: {
sessionOptions: {
headers: {
'User-Agent': 'MyCompany-DataResearch-Bot/1.0 (research@example.com)',
},
},
},
// Comprehensive error handling
requestHandlerTimeoutSecs: 60,
navigationTimeoutSecs: 120,
retryOnBlocked: true,
maxRetries: 3,
requestHandler: async ({ request, $ }) => {
try {
// Your scraping logic here
} catch (error) {
// Log and handle specific errors appropriately
console.error(`Error processing ${request.url}: ${error.message}`);
// You might want to save failed URLs for later inspection
await Dataset.pushData({
url: request.url,
error: error.message,
timestamp: new Date().toISOString(),
});
}
},
});
Conclusion
Crawlee and Cheerio represent a powerful combination for modern web scraping challenges. By leveraging Crawlee's sophisticated request handling, queueing, and resilience features alongside Cheerio's fast and intuitive HTML parsing, developers can build robust, scalable web scrapers that efficiently extract valuable data from the web.
Whether you're conducting price monitoring, content aggregation, lead generation, or SEO analysis, this toolkit provides the flexibility and performance needed to tackle projects of any scale. By following the best practices outlined in this guide and leveraging the code examples provided, you'll be well-equipped to build ethical, efficient web scrapers that deliver reliable results.
Enhance Your Web Scraping Strategy Today
Whether you're a data scientist, market researcher, or developer, Crawlee and Cheerio provide the tools you need to perform reliable and efficient web scraping. Add our high-quality proxies to your toolkit to prevent blocking and improve your scraping success rates.
About the Author
Coronium.io Organization
Coronium.io is a leading provider of advanced networking solutions, specializing in proxy services and VPN technologies. Committed to innovation and user satisfaction, Coronium.io offers tools that enhance online privacy, security, and performance for individuals and businesses alike.
Disclaimer
Our 4G mobile proxies are intended for legal and legitimate use only. This page is solely for informational and marketing purposes. It is the user's responsibility to ensure compliance with the terms of service of the platforms they are using our proxies on. We do not condone or support the use of our proxies for illegal or unauthorized activities. By using our proxies, you agree to use them in accordance with all applicable laws and regulations. We will not be held liable for any misuse of our proxies. Please read our Terms of Service before using our services.