Advanced Guide

Parsing with Crawlee: Advanced Web Scraping Guide

Web Scraping

JavaScript

Data Extraction

The comprehensive guide to advanced web parsing with Crawlee – from setup and crawler types to handling dynamic sites, CAPTCHAs, and scalable architecture. Learn how 4G mobile proxies ensure reliable data extraction, even from the most challenging websites.

EXPERT GUIDE

2025 EDITION

Why Crawlee is revolutionizing web scraping:

Unified API:Single interface for all crawler types (Cheerio, Playwright, Puppeteer, JSDOM).

Smart Queue:Automatic request deduplication, retries, and prioritization for efficient crawling.

Anti-blocking:Built-in proxy support and session management to avoid detection and blocks.

Scalability:Designed to handle millions of pages with automatic storage and concurrency management.

Key Specifications

Latest Version3.x

LanguageJavaScript/TypeScript

DependenciesNode.js 16+

Understanding Web Parsing and Crawlee

Web parsing (or scraping) is the process of extracting structured data from websites. While simple in concept, modern parsing faces challenges from dynamic content, anti-bot measures, and complex website architectures. If you're also interested in headless browser automation, check out our Ultimate Puppeteer Proxy Guide for complementary techniques.

Crawling

Crawling is the process of navigating through websites by following links, loading page content, and discovering new URLs to process. Think of it as mapping the website's structure.

•Systematically explores website structure
•Manages navigation and page discovery
•Fetches raw HTML/content for parsing

Parsing

Parsing is the extraction of specific data from web pages once they've been loaded. It involves identifying and retrieving structured information from the raw content.

•Extracts targeted information from pages
•Transforms unstructured HTML into structured data
•Uses selectors to identify and extract data points

What is Crawlee?

Crawlee is a powerful JavaScript library for web scraping and browser automation that combines both crawling and parsing capabilities. It's the successor to Apify SDK and represents the modern approach to web data extraction with built-in solutions for the most common scraping challenges.

Key Advantages of Crawlee

Multiple crawler types

Choose the right tool for each website's complexity.

Smart request queue

Automatic deduplication and prioritization.

Automatic data storage

Built-in dataset management for structured results.

Proxy integration

Seamless IP rotation and session management.

Retry mechanisms

Automatic handling of failed requests and timeouts.

Concurrency control

Optimize performance while avoiding detection.

Getting Started with Crawlee

Installation and Setup

1. Install Node.js

# Download from https://nodejs.org/ (LTS version recommended)

2. Create a new project

mkdir crawlee-project cd crawlee-project npm init -y

3. Install Crawlee

npm install crawlee

For TypeScript support, add TypeScript and related dependencies:

npm install typescript ts-node @types/node --save-dev

4. Create a basic crawler script

Create a file named main.js (or main.ts for TypeScript)

Choosing the Right Crawler Type

Crawlee offers multiple crawler implementations, each designed for specific use cases. Selecting the right crawler type is crucial for efficient and effective data extraction.

CheerioCrawler

Fast HTML parser based crawler for static content

Best For:

Static websites

Simple HTML parsing

High-volume scraping

SEO analysis

Pros:

✓ Extremely fast and lightweight
✓ Low memory footprint
✓ Perfect for simple HTML sites

Cons:

✕ Cannot handle JavaScript-rendered content
✕ No browser automation capabilities

PlaywrightCrawler

Modern multi-browser automation supporting Chromium, Firefox, and WebKit

Best For:

Dynamic JS websites

Single-page applications

Sites with complex interactions

E-commerce stores

Pros:

✓ Modern and actively maintained
✓ Supports multiple browsers
✓ Excellent for complex sites
✓ Powerful automation API

Cons:

✕ Heavier resource usage
✕ Slower than Cheerio for static content

PuppeteerCrawler

Chrome-based crawler for JavaScript-heavy websites

Best For:

Chrome-specific scraping

Legacy projects

Dynamic content

Pros:

✓ Direct Chrome integration
✓ Good community support
✓ Compatible with Chrome extensions

Cons:

✕ Limited to Chromium browsers
✕ Being replaced by Playwright in many projects

JSDOMCrawler

Lightweight browser-like environment for JavaScript execution

Best For:

Light JS processing

When browser features aren't needed

Memory-constrained environments

Pros:

✓ Faster than full browser automation
✓ Good middle ground between Cheerio and Playwright
✓ Lower resource usage

Cons:

✕ Limited JavaScript support compared to real browsers
✕ Cannot handle complex rendering

Which Crawler Should You Choose?

Choose CheerioCrawler for static websites that don't rely on JavaScript for rendering content. It's the fastest and most lightweight option.
Choose PlaywrightCrawler for modern websites with dynamic content, JavaScript rendering, or when you need to automate complex interactions like form submissions or logins.
Choose PuppeteerCrawler for Chrome-specific features or when working with existing Puppeteer code.
Choose JSDOMCrawler for sites with light JavaScript that don't require a full browser but need more than just HTML parsing.

Practical Code Examples

Basic Scraping with CheerioCrawler

Perfect for static websites without complex JavaScript rendering

import { CheerioCrawler } from 'crawlee';

// Initialize the CheerioCrawler
const crawler = new CheerioCrawler({
  // This function will be called for each URL
  requestHandler: async ({ request, $, enqueueLinks }) => {
    // Extract data from the page using Cheerio selectors
    const title = $('title').text();
    const h1Text = $('h1').text();
    const productPrices = [];
    
    // Extract all product prices
    $('.product-price').each((index, el) => {
      productPrices.push($(el).text().trim());
    });
    
    // Log the results
    console.log(`URL: ${request.url}`);
    console.log(`Title: ${title}`);
    console.log(`H1: ${h1Text}`);
    console.log('Product prices:', productPrices);
    
    // Optionally, enqueue links for crawling
    await enqueueLinks({
      selector: '.pagination a',
      baseUrl: request.loadedUrl,
    });
  },
  // Other options...
  maxRequestsPerCrawl: 10, // Limit the crawler to 10 requests
  maxConcurrency: 5,       // Maximum concurrent requests
});

// Start the crawler with a list of URLs
await crawler.run(['https://example.com/products']);

Key points:

CheerioCrawler uses Cheerio under the hood, which has a jQuery-like API
The requestHandler function is called for each page
Use CSS selectors to extract data from the page
The enqueueLinks function allows you to add more URLs to the crawl queue

Solution:

Preserve and manage cookies between requests, maintain session state, and implement login flows where necessary.

The 4G Mobile Proxy Advantage

4G mobile proxies provide unique advantages for web scraping that are difficult to match with other proxy types:

Genuine Residential IPs

4G proxies use IPs from actual mobile carriers, making them indistinguishable from real mobile users – they ARE real mobile connections.

High Trust Scores

Mobile IPs typically have excellent trust scores with minimal abuse history, reducing the likelihood of being blocked.

Geographic Targeting

Access country-specific IPs to scrape localized content and overcome geo-restrictions legitimately.

Stable Connections

Maintain consistent connections for extended scraping sessions with minimal disconnections.

Advanced Scraping Techniques

Handling Login-Protected Content

Access authenticated areas with session management

Many valuable data sources require authentication. Here's how to handle login-protected content with Crawlee:

Create a session-handling crawler with PlaywrightCrawler
Perform login once at the beginning of your scraping session
Store cookies and session data to maintain authentication
Use a dedicated proxy to keep a consistent IP address throughout the session

Pro tip: With 4G mobile proxies, your session is less likely to be flagged as suspicious compared to datacenter IPs, as they appear as genuine residential users.

Extracting Data from Interactive Elements

Handle tabs, accordion menus, and other dynamic content

Modern websites often hide content behind interactive elements. Here's how to access it:

Use Playwright's interaction API to click on elements, hover, and interact
Implement proper waiting strategies using waitForSelector or waitForFunction
Create custom navigation functions for complex multi-step processes
Handle lazy-loading content by scrolling and monitoring DOM changes

Pro tip: When interacting with a website, add random delays between actions to mimic natural human behavior and reduce the chance of being detected as a bot.

Building Resilient Scrapers

Design scrapers that handle errors and site changes gracefully

Websites change frequently, and a robust scraper should adapt to these changes:

Implement multiple selector strategies with fallbacks for critical data
Use regular expressions or fuzzy matching for more flexible data extraction
Log detailed information about failures to quickly identify issues
Set up monitoring to alert you when success rates drop below thresholds
Implement automatic retries with different proxies on failure

Pro tip: Use 4G mobile proxies from different carriers and regions to increase your scraping reliability. If one carrier is blocked, others may still work perfectly.

Ready to Build Reliable Web Scrapers?

Combine the power of Crawlee with Coronium's 4G mobile proxies for unparalleled scraping reliability. Get access to clean residential IPs with high trust scores that make your scrapers virtually undetectable.

Questions? Contact us at hello@coronium.io or via our Telegram channel

Related Resources

Web Parsing with 4G Proxies

Learn how 4G proxies enhance web scraping reliability and bypass common anti-scraping measures.

Ultimate Puppeteer Proxy Guide

Master Puppeteer with mobile proxies for robust browser automation with 95%+ success rates across challenging websites.

4G Mobile Proxies for Web Scraping

Get dedicated 4G mobile proxies specifically optimized for reliable data extraction with unlimited bandwidth.