All systems operationalโ€ขIP pool status
Coronium Mobile Proxies

ScapeGraph AI: Intelligent Web Scraping with LLMs

ScapeGraph AI
AI Scraping
LLM
Python
Proxies
Web Data Extraction

Introduction

ScapeGraph AI Logo

The web scraping landscape has been revolutionized by the integration of artificial intelligence. Traditional web scraping methods often rely on rigid selectors and patterns that break when websites change their structure. This creates a maintenance burden for developers who must constantly update their scrapers to adapt to these changes.

ScapeGraph AI represents a paradigm shift in this field by leveraging the power of Large Language Models (LLMs) to create intelligent, flexible web scraping pipelines. Rather than depending on fixed patterns, ScapeGraph AI understands the content it's extracting, making it resilient to changes in website layouts and structures.

What is ScapeGraph AI?

ScapeGraph AI is an open-source Python library designed to transform web scraping through the application of advanced AI techniques. Unlike traditional scraping tools that break when websites change their HTML structure, ScapeGraph AI uses the natural language understanding capabilities of Large Language Models to extract data based on meaning rather than fixed patterns.

The core concept is simple yet powerful: instead of writing complex selectors, you simply describe what information you want to extract in natural language. ScapeGraph AI then handles the technical details of finding and extracting that information, regardless of how it's structured in the source.

Key Benefits of ScapeGraph AI

  • Adaptability: Automatically adjusts to changes in website structures and layouts
  • Simplicity: Requires minimal configuration and coding compared to traditional scrapers
  • Intelligence: Understands the context and meaning of the content being scraped
  • Flexibility: Works with various document types including HTML, XML, JSON, and Markdown
  • Extensibility: Supports multiple LLM providers and configurations

At its core, ScapeGraph AI implements a graph-based architecture where each node in the graph represents a specific operation in the scraping pipeline. This modular approach allows for incredible flexibility in how data is extracted, processed, and structured.

Key Features

AI-Powered Scraping

Uses LLMs to understand and extract data based on meaning rather than fixed patterns, making scrapers more resilient to website changes.

Graph-Based Architecture

Modular pipeline design allows complex scraping tasks to be broken down into manageable, interconnected operations.

Multiple LLM Support

Compatible with OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Mistral AI, and local models via Ollama.

Multi-Format Support

Extracts data from HTML, XML, JSON, and Markdown files using the same intuitive interface.

Specialized Scrapers

Includes SmartScraperGraph, SearchGraph, and SpeechGraph for different extraction scenarios.

Developer-Friendly

Simple Python API with minimal configuration required to get started with advanced scraping capabilities.

These features combine to create a web scraping solution that's not only powerful but also adaptable to changing websites and requirements. The intelligent nature of ScapeGraph AI reduces the maintenance burden traditionally associated with web scrapers, allowing developers to focus on extracting value from the data rather than constantly updating their code.

How ScapeGraph AI Works

ScapeGraph AI's approach to web scraping is fundamentally different from traditional methods. Instead of relying on CSS selectors or XPath expressions that break when websites change, it uses Large Language Models to understand the content and extract the relevant information regardless of the specific HTML structure.

The Core Pipeline

  1. 1

    Content Acquisition

    The library fetches content from a website or loads it from a local file.

  2. 2

    Content Processing

    The content is preprocessed, chunked if necessary to handle token limits, and prepared for the LLM.

  3. 3

    LLM Analysis

    The LLM interprets the content and the user's extraction prompt to understand what data needs to be extracted.

  4. 4

    Intelligent Extraction

    The system extracts the specified information based on the LLM's understanding, not on fixed patterns.

  5. 5

    Result Formatting

    The extracted data is structured according to the user's requirements and returned as the final output.

Graph-Based Architecture

The name "ScapeGraph" comes from the library's graph-based pipeline architecture. Each stage in the scraping process is represented as a node in a directed graph, with edges representing the flow of data between stages. This approach offers several advantages:

  • Modularity: Each node performs a specific function, allowing for easy customization and extension.
  • Flexibility: Pipelines can be reconfigured by changing the connections between nodes.
  • Reusability: Common operations can be encapsulated in reusable nodes.
  • Transparency: The graph structure makes it clear how data flows through the system.

LLM Integration

ScapeGraph AI supports multiple LLM providers, allowing users to choose the model that best fits their needs and budget. The library handles the complexities of token limits, prompt engineering, and response parsing, making it easy to leverage the power of LLMs without getting bogged down in technical details.

Installation and Setup

Getting started with ScapeGraph AI is straightforward. The library requires Python and can be installed using pip.

Prerequisites

  • Python 3.9 or higher
  • Pip package manager

Installation Steps

# Install ScapeGraph AI pip install scrapegraphai # Install Playwright (for browser-based scraping) playwright install

Configuration

After installation, you'll need to configure ScapeGraph AI to use your preferred LLM provider. The library supports various providers, each with its own configuration requirements.

OpenAI Configuration

# OpenAI configuration
import os
from scrapegraphai.graphs import SmartScraperGraph

# Set your API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Configure the graph
graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    },
    "verbose": True
}

Ollama (Local Model) Configuration

# Ollama configuration (for local models)
graph_config = {
    "llm": {
        "provider": "ollama",
        "model": "llama3.2",
        "model_tokens": 8192
    },
    "verbose": True
}

Other supported providers include Azure OpenAI, Google Gemini, Anthropic Claude, and Mistral AI, each with similar configuration patterns. The documentation provides detailed examples for each provider.

Basic Usage Examples

ScapeGraph AI provides an intuitive API that makes it easy to extract data from websites. The following examples demonstrate common usage patterns.

Basic Web Scraping

This example shows how to extract information from a website using the SmartScraperGraph:

from scrapegraphai.graphs import SmartScraperGraph

# Configure the graph
graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    },
    "verbose": True,
    "headless": False  # Set to True to run browser in headless mode
}

# Create the scraper
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract the main headline, author name, publication date, and key points from this news article",
    source="https://example-news-site.com/article/123",
    config=graph_config
)

# Run the scraper and get the results
result = smart_scraper_graph.run()
print(result)

Extracting Data from Local Files

ScapeGraph AI can also extract information from local files:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    }
}

# Create the scraper with a local file as the source
smart_scraper_graph = SmartScraperGraph(
    prompt="Extract all product names, prices, and specifications from this page",
    source="path/to/local/file.html",  # Local file path
    config=graph_config
)

# Run the scraper
result = smart_scraper_graph.run()
print(result)

Structured Data Extraction

You can request structured data in specific formats:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    }
}

# Request JSON-formatted data
smart_scraper_graph = SmartScraperGraph(
    prompt="""
    Extract the following information from this e-commerce page:
    1. Product name
    2. Price
    3. Available colors
    4. Sizes
    5. Customer rating
    
    Return the data as a JSON object with these fields.
    """,
    source="https://example-shop.com/products/item-123",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)  # This will be a JSON-formatted string

Advanced Techniques

ScapeGraph AI offers advanced features for more complex scraping scenarios. These techniques allow you to handle larger websites, customize the scraping process, and integrate with other tools.

Search-Based Extraction

The SearchGraph allows you to search for specific information within a larger document:

from scrapegraphai.graphs import SearchGraph

graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    }
}

# Create a SearchGraph for finding specific information
search_graph = SearchGraph(
    query="What is the return policy for electronics?",
    source="https://example-shop.com/terms-and-conditions",
    config=graph_config
)

result = search_graph.run()
print(result)  # This will return the relevant information about return policies

Speech Processing

SpeechGraph can process audio content:

from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    }
}

# Process audio content
speech_graph = SpeechGraph(
    prompt="Summarize the key points discussed in this podcast",
    source="path/to/audio/file.mp3",
    config=graph_config
)

result = speech_graph.run()
print(result)  # Summary of the audio content

Handling Pagination

For websites with paginated content, you can iteratively scrape multiple pages:

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    },
    "headless": True
}

# Function to scrape paginated content
def scrape_paginated_site():
    base_url = "https://example-site.com/products?page="
    all_results = []
    
    # Scrape multiple pages
    for page_num in range(1, 6):  # Pages 1-5
        url = base_url + str(page_num)
        
        scraper = SmartScraperGraph(
            prompt="Extract all product names, prices, and availability from this page",
            source=url,
            config=graph_config
        )
        
        result = scraper.run()
        all_results.append(result)
    
    return all_results

# Run the paginated scraper
paginated_results = scrape_paginated_site()
print(paginated_results)

Custom Graph Pipelines

Advanced users can create custom graph pipelines for specialized scraping needs:

from scrapegraphai.nodes import FetchNode, ParseNode, LLMNode, OutputNode
from scrapegraphai.graphs import BaseGraph

# Create a custom graph pipeline
class CustomScraperGraph(BaseGraph):
    def __init__(self, prompt, source, config=None):
        super().__init__(config)
        
        # Define nodes
        self.fetch_node = FetchNode()
        self.parse_node = ParseNode()
        self.llm_node = LLMNode(prompt=prompt)
        self.output_node = OutputNode()
        
        # Connect nodes
        self.connect(self.fetch_node, self.parse_node)
        self.connect(self.parse_node, self.llm_node)
        self.connect(self.llm_node, self.output_node)
        
        # Set source
        self.fetch_node.source = source

# Use the custom graph
custom_graph = CustomScraperGraph(
    prompt="Extract specific information X, Y, and Z",
    source="https://example.com",
    config=graph_config
)

result = custom_graph.run()
print(result)

Using ScapeGraph AI with Proxies

When scraping websites at scale, you may encounter rate limits or IP blocks. Using proxies can help overcome these limitations by distributing your requests across multiple IP addresses. ScapeGraph AI can be configured to work with proxies, including mobile proxies from Coronium.io.

Why Use Proxies with ScapeGraph AI?

  • Avoid Rate Limiting: Distribute requests across multiple IPs to avoid triggering rate limits.
  • Bypass IP Blocks: Continue scraping even if some IPs are blocked by the target website.
  • Geo-Targeting: Access location-specific content by using proxies from different regions.
  • Enhanced Privacy: Hide your real IP address for more anonymous scraping.

Configuring Proxies in ScapeGraph AI

ScapeGraph AI supports proxy configuration through its underlying HTTP libraries. Here's how to set up proxies:

from scrapegraphai.graphs import SmartScraperGraph

# Proxy configuration
proxy_config = {
    "proxy": {
        "http": "http://username:password@proxy.coronium.io:8000",
        "https": "http://username:password@proxy.coronium.io:8000"
    }
}

# Combine with LLM configuration
graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    },
    "headless": True,
    "proxy": proxy_config["proxy"]  # Add proxy configuration
}

# Create scraper with proxy configuration
scraper = SmartScraperGraph(
    prompt="Extract product information from this e-commerce site",
    source="https://example-shop.com/products",
    config=graph_config
)

result = scraper.run()
print(result)

Using Mobile Proxies with ScapeGraph AI

Mobile proxies provide IP addresses from mobile carriers, which are less likely to be blocked by websites. Coronium.io offers reliable mobile proxies that work seamlessly with ScapeGraph AI:

from scrapegraphai.graphs import SmartScraperGraph
import random

# List of mobile proxies
mobile_proxies = [
    "http://username:password@us-east.proxy.coronium.io:9000",
    "http://username:password@uk.proxy.coronium.io:9000",
    "http://username:password@germany.proxy.coronium.io:9000",
]

# Randomly select a proxy for each request
selected_proxy = random.choice(mobile_proxies)

# Configure graph with mobile proxy
graph_config = {
    "llm": {
        "provider": "openai",
        "model": "gpt-4",
        "model_tokens": 8192
    },
    "headless": True,
    "proxy": {
        "http": selected_proxy,
        "https": selected_proxy
    }
}

# Create and run the scraper
scraper = SmartScraperGraph(
    prompt="Extract pricing information for all products on this page",
    source="https://example.com/products",
    config=graph_config
)

result = scraper.run()
print(result)

Why Choose Coronium Mobile Proxies

Coronium.io's mobile proxies offer several advantages for AI-powered web scraping:

  • High Success Rates: Mobile IPs are less likely to be blocked by websites, improving scraping reliability.
  • Geographic Diversity: Access to IPs from multiple countries for geo-targeted scraping.
  • Stable Connections: Reliable infrastructure ensures consistent performance for your ScapeGraph AI applications.
  • Flexible Authentication: Simple username/password authentication that integrates easily with ScapeGraph AI.
  • Rotating IPs: Options for both static and rotating IPs to suit different scraping strategies.

Use Cases for ScapeGraph AI

ScapeGraph AI's intelligent approach to web scraping opens up a wide range of applications across different industries. Here are some of the most common use cases:

E-commerce Monitoring

Track competitor prices, product availability, and promotions across multiple online stores to inform pricing and inventory decisions.

Market Research

Gather product reviews, ratings, and market trends from various sources to understand consumer sentiment and market dynamics.

Content Aggregation

Collect and organize content from multiple websites to create news aggregators, research databases, or specialized information portals.

Data Enrichment

Enhance existing datasets with additional information scraped from the web, providing more context and value for analysis.

Real-World Examples

Financial Analysis

A financial research firm uses ScapeGraph AI to extract earnings reports, analyst estimates, and financial news from various sources. The AI-powered approach allows them to quickly gather and process information even when websites change their layouts.

# Financial data extraction example
from scrapegraphai.graphs import SmartScraperGraph

scraper = SmartScraperGraph(
    prompt="""
    Extract the following financial metrics for Q3 2024:
    1. Revenue
    2. Net Income
    3. EPS (Earnings Per Share)
    4. Revenue Growth (YoY)
    5. Profit Margin
    
    Also extract any forward guidance for Q4 2024.
    """,
    source="https://example-financial-site.com/earnings/company-xyz",
    config=graph_config
)

financial_data = scraper.run()
print(financial_data)

Real Estate Market Analysis

A real estate analytics company uses ScapeGraph AI to track property listings, prices, and features across multiple listing websites. This data helps identify market trends and provide insights to buyers and sellers.

# Real estate data extraction
from scrapegraphai.graphs import SmartScraperGraph

scraper = SmartScraperGraph(
    prompt="""
    Extract all property listings with the following details:
    1. Property type (house, apartment, etc.)
    2. Price
    3. Location (address and neighborhood)
    4. Square footage
    5. Number of bedrooms and bathrooms
    6. Year built
    7. Special features or amenities
    
    Format the data as a JSON array of objects.
    """,
    source="https://example-realestate.com/listings?city=newyork",
    config=graph_config
)

property_listings = scraper.run()
print(property_listings)

Research Paper Analysis

Academic researchers use ScapeGraph AI to extract key information from research papers, including methodologies, findings, and citations. This helps them stay current with the latest developments in their field.

# Research paper analysis
from scrapegraphai.graphs import SearchGraph

search_graph = SearchGraph(
    query="""
    What are the main findings related to transformer architecture improvements
    in NLP research papers published in the last two years?
    Summarize the key innovations and performance improvements.
    """,
    source="https://example-academic-repository.com/papers/computer-science/nlp",
    config=graph_config
)

research_findings = search_graph.run()
print(research_findings)

Conclusion

ScapeGraph AI represents a significant advancement in web scraping technology. By leveraging the power of Large Language Models and a flexible graph-based architecture, it offers a more intelligent, adaptable approach to data extraction that can handle changes in website structures without requiring constant maintenance.

The key advantages of ScapeGraph AI include:

  • Reduced Maintenance: Less need to update scraping code when websites change their layout.
  • Simplified Development: Natural language prompts instead of complex selectors and patterns.
  • Flexibility: Support for multiple document types and LLM providers.
  • Scalability: Ability to handle complex scraping tasks through custom graph pipelines.

When combined with reliable mobile proxies from Coronium.io, ScapeGraph AI becomes an even more powerful tool for web data extraction, offering improved success rates, geo-targeting capabilities, and enhanced privacy.

Enhance Your AI-Powered Web Scraping Today

Whether you're a data scientist, market researcher, or developer, combining ScapeGraph AI with Coronium's mobile proxies gives you a powerful toolkit for reliable and efficient web data extraction. Get started today to experience the benefits of intelligent, adaptive web scraping.

Published on

About the Author

Coronium.io Organization

Coronium.io is a leading provider of advanced networking solutions, specializing in proxy services and VPN technologies. Committed to innovation and user satisfaction, Coronium.io offers tools that enhance online privacy, security, and performance for individuals and businesses alike.

Disclaimer

Our 4G mobile proxies are intended for legal and legitimate use only. This page is solely for informational and marketing purposes. It is the user's responsibility to ensure compliance with the terms of service of the platforms they are using our proxies on. We do not condone or support the use of our proxies for illegal or unauthorized activities. By using our proxies, you agree to use them in accordance with all applicable laws and regulations. We will not be held liable for any misuse of our proxies. Please read our Terms of Service before using our services.