All systems operationalโ€ขIP pool status
Coronium Mobile Proxies
Advanced Guide

Parsing with Crawlee: Advanced Web Scraping Guide

Web Scraping
JavaScript
Data Extraction

The comprehensive guide to advanced web parsing with Crawlee โ€“ from setup and crawler types to handling dynamic sites, CAPTCHAs, and scalable architecture. Learn how 4G mobile proxies ensure reliable data extraction, even from the most challenging websites.

EXPERT GUIDE
2025 EDITION

Why Crawlee is revolutionizing web scraping:

Unified API:Single interface for all crawler types (Cheerio, Playwright, Puppeteer, JSDOM).
Smart Queue:Automatic request deduplication, retries, and prioritization for efficient crawling.
Anti-blocking:Built-in proxy support and session management to avoid detection and blocks.
Scalability:Designed to handle millions of pages with automatic storage and concurrency management.
Key Specifications
Latest Version3.x
LanguageJavaScript/TypeScript
DependenciesNode.js 16+

Understanding Web Parsing and Crawlee

Web parsing (or scraping) is the process of extracting structured data from websites. While simple in concept, modern parsing faces challenges from dynamic content, anti-bot measures, and complex website architectures.

Crawling

Crawling is the process of navigating through websites by following links, loading page content, and discovering new URLs to process. Think of it as mapping the website's structure.

  • โ€ขSystematically explores website structure
  • โ€ขManages navigation and page discovery
  • โ€ขFetches raw HTML/content for parsing

Parsing

Parsing is the extraction of specific data from web pages once they've been loaded. It involves identifying and retrieving structured information from the raw content.

  • โ€ขExtracts targeted information from pages
  • โ€ขTransforms unstructured HTML into structured data
  • โ€ขUses selectors to identify and extract data points

What is Crawlee?

Crawlee is a powerful JavaScript library for web scraping and browser automation that combines both crawling and parsing capabilities. It's the successor to Apify SDK and represents the modern approach to web data extraction with built-in solutions for the most common scraping challenges.

Key Advantages of Crawlee

Multiple crawler types

Choose the right tool for each website's complexity.

Smart request queue

Automatic deduplication and prioritization.

Automatic data storage

Built-in dataset management for structured results.

Proxy integration

Seamless IP rotation and session management.

Retry mechanisms

Automatic handling of failed requests and timeouts.

Concurrency control

Optimize performance while avoiding detection.

Getting Started with Crawlee

Installation and Setup

1. Install Node.js

# Download from https://nodejs.org/ (LTS version recommended)

2. Create a new project

mkdir crawlee-project cd crawlee-project npm init -y

3. Install Crawlee

npm install crawlee

For TypeScript support, add TypeScript and related dependencies:

npm install typescript ts-node @types/node --save-dev

4. Create a basic crawler script

Create a file named main.js (or main.ts for TypeScript)

Choosing the Right Crawler Type

Crawlee offers multiple crawler implementations, each designed for specific use cases. Selecting the right crawler type is crucial for efficient and effective data extraction.

CheerioCrawler

Fast HTML parser based crawler for static content

Best For:

Static websites
Simple HTML parsing
High-volume scraping
SEO analysis

Pros:

  • โœ“ Extremely fast and lightweight
  • โœ“ Low memory footprint
  • โœ“ Perfect for simple HTML sites

Cons:

  • โœ• Cannot handle JavaScript-rendered content
  • โœ• No browser automation capabilities

PlaywrightCrawler

Modern multi-browser automation supporting Chromium, Firefox, and WebKit

Best For:

Dynamic JS websites
Single-page applications
Sites with complex interactions
E-commerce stores

Pros:

  • โœ“ Modern and actively maintained
  • โœ“ Supports multiple browsers
  • โœ“ Excellent for complex sites
  • โœ“ Powerful automation API

Cons:

  • โœ• Heavier resource usage
  • โœ• Slower than Cheerio for static content

PuppeteerCrawler

Chrome-based crawler for JavaScript-heavy websites

Best For:

Chrome-specific scraping
Legacy projects
Dynamic content

Pros:

  • โœ“ Direct Chrome integration
  • โœ“ Good community support
  • โœ“ Compatible with Chrome extensions

Cons:

  • โœ• Limited to Chromium browsers
  • โœ• Being replaced by Playwright in many projects

JSDOMCrawler

Lightweight browser-like environment for JavaScript execution

Best For:

Light JS processing
When browser features aren't needed
Memory-constrained environments

Pros:

  • โœ“ Faster than full browser automation
  • โœ“ Good middle ground between Cheerio and Playwright
  • โœ“ Lower resource usage

Cons:

  • โœ• Limited JavaScript support compared to real browsers
  • โœ• Cannot handle complex rendering

Which Crawler Should You Choose?

  • Choose CheerioCrawler for static websites that don't rely on JavaScript for rendering content. It's the fastest and most lightweight option.
  • Choose PlaywrightCrawler for modern websites with dynamic content, JavaScript rendering, or when you need to automate complex interactions like form submissions or logins.
  • Choose PuppeteerCrawler for Chrome-specific features or when working with existing Puppeteer code.
  • Choose JSDOMCrawler for sites with light JavaScript that don't require a full browser but need more than just HTML parsing.

Practical Code Examples

Basic Scraping with CheerioCrawler

Perfect for static websites without complex JavaScript rendering

import { CheerioCrawler } from 'crawlee';

// Initialize the CheerioCrawler
const crawler = new CheerioCrawler({
  // This function will be called for each URL
  requestHandler: async ({ request, $, enqueueLinks }) => {
    // Extract data from the page using Cheerio selectors
    const title = $('title').text();
    const h1Text = $('h1').text();
    const productPrices = [];
    
    // Extract all product prices
    $('.product-price').each((index, el) => {
      productPrices.push($(el).text().trim());
    });
    
    // Log the results
    console.log(`URL: ${request.url}`);
    console.log(`Title: ${title}`);
    console.log(`H1: ${h1Text}`);
    console.log('Product prices:', productPrices);
    
    // Optionally, enqueue links for crawling
    await enqueueLinks({
      selector: '.pagination a',
      baseUrl: request.loadedUrl,
    });
  },
  // Other options...
  maxRequestsPerCrawl: 10, // Limit the crawler to 10 requests
  maxConcurrency: 5,       // Maximum concurrent requests
});

// Start the crawler with a list of URLs
await crawler.run(['https://example.com/products']);

Key points:

  • CheerioCrawler uses Cheerio under the hood, which has a jQuery-like API
  • The requestHandler function is called for each page
  • Use CSS selectors to extract data from the page
  • The enqueueLinks function allows you to add more URLs to the crawl queue

Common Challenges and Solutions

Modern web scraping faces numerous challenges as websites implement increasingly sophisticated anti-bot measures. Here's how to overcome the most common obstacles using Crawlee and 4G mobile proxies.

Anti-Bot Protection

Challenge:

Websites implement sophisticated anti-bot measures to detect and block automated scraping.

Solution:

Use browser fingerprint randomization, realistic mouse movements and delays, and high-quality 4G mobile proxies that have real-user trust scores.

Dynamic Content Loading

Challenge:

Content loads via JavaScript or lazy-loading techniques, making it invisible to basic scrapers.

Solution:

Use PlaywrightCrawler or PuppeteerCrawler with proper waiting strategies (waitForSelector, waitForFunction) to ensure content is fully loaded before extraction.

CAPTCHA Barriers

Challenge:

CAPTCHAs and verification challenges block automated access to websites.

Solution:

Use 4G mobile proxies that have high trust scores, implement session persistence, and minimize request patterns that trigger CAPTCHA systems.

IP Blocking and Rate Limiting

Challenge:

Sites block IPs that make too many requests or follow suspicious patterns.

Solution:

Implement proxy rotation with high-quality 4G mobile proxies, introduce random delays, and limit concurrent requests per domain.

Changing Website Structure

Challenge:

Website markup changes frequently, breaking selectors and extraction patterns.

Solution:

Create robust selectors, implement resilient data extraction with multiple fallback patterns, and set up monitoring for scraper health.

Session Management

Challenge:

Websites use cookies and session data to track users and detect automation.

Solution:

Preserve and manage cookies between requests, maintain session state, and implement login flows where necessary.

The 4G Mobile Proxy Advantage

4G mobile proxies provide unique advantages for web scraping that are difficult to match with other proxy types:

Genuine Residential IPs

4G proxies use IPs from actual mobile carriers, making them indistinguishable from real mobile users โ€“ they ARE real mobile connections.

High Trust Scores

Mobile IPs typically have excellent trust scores with minimal abuse history, reducing the likelihood of being blocked.

Geographic Targeting

Access country-specific IPs to scrape localized content and overcome geo-restrictions legitimately.

Stable Connections

Maintain consistent connections for extended scraping sessions with minimal disconnections.

Advanced Scraping Techniques

Handling Login-Protected Content

Access authenticated areas with session management

Many valuable data sources require authentication. Here's how to handle login-protected content with Crawlee:

  1. Create a session-handling crawler with PlaywrightCrawler
  2. Perform login once at the beginning of your scraping session
  3. Store cookies and session data to maintain authentication
  4. Use a dedicated proxy to keep a consistent IP address throughout the session

Pro tip: With 4G mobile proxies, your session is less likely to be flagged as suspicious compared to datacenter IPs, as they appear as genuine residential users.

Extracting Data from Interactive Elements

Handle tabs, accordion menus, and other dynamic content

Modern websites often hide content behind interactive elements. Here's how to access it:

  • Use Playwright's interaction API to click on elements, hover, and interact
  • Implement proper waiting strategies using waitForSelector or waitForFunction
  • Create custom navigation functions for complex multi-step processes
  • Handle lazy-loading content by scrolling and monitoring DOM changes

Pro tip: When interacting with a website, add random delays between actions to mimic natural human behavior and reduce the chance of being detected as a bot.

Building Resilient Scrapers

Design scrapers that handle errors and site changes gracefully

Websites change frequently, and a robust scraper should adapt to these changes:

  • Implement multiple selector strategies with fallbacks for critical data
  • Use regular expressions or fuzzy matching for more flexible data extraction
  • Log detailed information about failures to quickly identify issues
  • Set up monitoring to alert you when success rates drop below thresholds
  • Implement automatic retries with different proxies on failure

Pro tip: Use 4G mobile proxies from different carriers and regions to increase your scraping reliability. If one carrier is blocked, others may still work perfectly.

Ready to Build Reliable Web Scrapers?

Combine the power of Crawlee with Coronium's 4G mobile proxies for unparalleled scraping reliability. Get access to clean residential IPs with high trust scores that make your scrapers virtually undetectable.

Questions? Contact us at hello@coronium.io or via our Telegram channel