Home » ScrapeGraphAI: Revolutionizing Web Scraping with LLMs in 2024
ScrapeGraphAI: Revolutionizing Web Scraping with LLMs in 2024
Official streamlit demo: https://scrapegraph-ai-demo.streamlit.app/
Web scraping has long been a critical tool for data collection, powering market research, competitive analysis, and various other data-driven applications. Traditional methods of web scraping often struggle with sophisticated anti-bot measures, dynamic content, and the sheer scale of data available online.
In this article you will learn about ScrapeGraphAI, a revolutionary web scraping library that leverages large language models (LLMs) and direct graph logic to create intelligent scraping pipelines. This article will review ScrapeGraphAI, explore its capabilities, and demonstrate how it can be used to fast-track web scraping with cutting-edge technology.
In this article you will learn about ScrapeGraphAI, a revolutionary web scraping library that leverages large language models (LLMs) and direct graph logic to create intelligent scraping pipelines. This article will review ScrapeGraphAI, explore its capabilities, and demonstrate how it can be used to fast-track web scraping with cutting-edge technology.
Introduction to ScrapeGraphAI
ScrapeGraphAI is a Python library designed to simplify and enhance the web scraping process by utilizing LLMs and graph logic. This innovative approach allows users to specify the information they want to extract, and the library handles the rest, creating efficient and effective scraping pipelines.
Key Features
- LLM Integration: Uses large language models to understand and process web content.
- Direct Graph Logic: Constructs logical pipelines for scraping tasks.
- Multiple Models: Supports various LLMs, including OpenAI, Groq, Azure, and local models with Ollama.
- Versatile Pipelines: Provides single-page and multi-page scraping capabilities, as well as audio file generation.
Installation and Setup
pip install scrapegraphai
playwright install
It’s recommended to install the library in a virtual environment to avoid conflicts with other libraries.
ScrapeGraphAI offers three main scraping pipelines:
- SmartScraperGraph: Single-page scraper requiring a user prompt and an input source.
- SearchGraph: Multi-page scraper that extracts information from the top n search results of a search engine.
- SpeechGraph: Single-page scraper that generates an audio file from the extracted information.
Why ScrapeGraphAI?
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapeGraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
This flexibility ensures that scrapers remain functional even when website layouts change. We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
This flexibility ensures that scrapers remain functional even when website layouts change. We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
Use Cases and Examples
Case 1: SmartScraperGraph Using Local Models
The SmartScraperGraph is ideal for scraping specific information from a single page.
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json",
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"verbose": True,
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me the latest news headlines and summaries",
source="https://news.ycombinator.com/",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
This example configures a local model to scrape project descriptions from a given URL.
Output:
{
"latest_news": [
{
"headline": "OpenAI departures: Why can\u2019t former employees talk?",
"summary": "An article on Vox discussing the reasons behind the silence of former OpenAI employees."
},
{
"headline": "Bend: a high-level language that runs on GPUs (via HVM2)",
"summary": "A GitHub project introducing Bend, a high-level language designed to run on GPUs."
},
{
"headline": "Zoraxy: Open-Source, All in one homelab network routing solution",
"summary": "An open-source project called Zoraxy that provides a comprehensive network routing solution for homelabs."
},
{
"headline": "Wuffs: Wrangling Untrusted File Formats Safely",
"summary": "A GitHub project by Google that focuses on safely handling untrusted file formats."
},
{
"headline": "Non-Euclidean Doom: what happens to a game when pi is not 3.14159 (2022) ",
"summary": "A video from CCC exploring the effects on the game Doom when the value of pi is altered."
},
{
"headline": "Toon3D: Seeing cartoons from a new perspective",
"summary": "A project by Toon3D Studio that offers a new perspective on viewing cartoons."
},
]
}
Case 2: SearchGraph Using Mixed Models
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434",
},
"max_results": 5,
}
search_graph = SearchGraph(
prompt="List me the top-rated programming books with their descriptions",
config=graph_config
)
result = search_graph.run()
print(result)
The SearchGraph pipeline is perfect for extracting information from multiple pages based on search engine results.
In the SearchGraph configuration, you can specify the search engine by adding a parameter for the search engine in the configuration settings. ScrapeGraphAI typically has a default search engine, but you can set it explicitly if needed.
Output:
{
"books": [
{"title": "Clean Code", "description": "A Handbook of Agile Software Craftsmanship."},
{"title": "The Pragmatic Programmer", "description": "Your Journey to Mastery."},
{"title": "Design Patterns", "description": "Elements of Reusable Object-Oriented Software."},
{"title": "Code Complete", "description": "A Practical Handbook of Software Construction."},
{"title": "You Don't Know JS", "description": "A series of books diving deep into JavaScript."}
]
}
Case 3: SpeechGraph Using OpenAI
The SpeechGraph pipeline can be used to convert scraped data into an audio file.
In the SearchGraph configuration, you can specify the search engine by adding a parameter for the search engine in the configuration settings. ScrapeGraphAI typically has a default search engine, but you can set it explicitly if needed.
from scrapegraphai.graphs import SpeechGraph
graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}
speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)
result = speech_graph.run()
print(result)
Output:
The output will be an audio file summarizing the projects on the page.
ScrapeGraphAI: The Future of Web Scraping
ScrapeGraphAI is designed to address many of these challenges by leveraging advanced AI and LLMs to create intelligent and adaptive scraping pipelines. Its integration with various LLMs allows it to understand and process web content more effectively than traditional scraping tools.
ScrapeGraphAI Roadmap:
Short-Term
- Integration with more LLM APIs.
- Test proxy rotation implementation.
- Add more search engines inside the SearchInternetNode.
- Improve documentation and create tutorials.
Mid-Term
- Node for handling API requests.
- Improve SearchGraph to analyze the first 5 search engine results.
- Create a DOM tree of websites and study tree forks from the root node.
- Develop a scraping folder containing reports, DOM trees, and metrics.
Long-Term Goals:
- Automatic generation of scraping pipelines from prompts.
- Create an API for the library.
- Finetune an LLM for HTML content.
Conclusion
ScrapeGraphAI represents a new era of web scraping, combining the power of LLMs with intelligent graph logic to automate and optimize the extraction of web data. By addressing the challenges of traditional scraping methods and providing advanced features like proxy rotation, parallel processing, and comprehensive reporting, ScrapeGraphAI is poised to become a vital tool for anyone looking to harness the full potential of web data.
Whether you’re a researcher, data scientist, or business analyst, ScrapeGraphAI offers a powerful, flexible, and scalable solution to meet your web scraping needs. Embrace this new age of web scraping and unlock the insights hidden within the vast expanse of online data.
Whether you’re a researcher, data scientist, or business analyst, ScrapeGraphAI offers a powerful, flexible, and scalable solution to meet your web scraping needs. Embrace this new age of web scraping and unlock the insights hidden within the vast expanse of online data.