← Back to blog

FireScraper Python SDK: Sync, Async, and LangChain Integration

4 min read
announcement
python
langchain
sdk

The FireScraper Python SDK is live on PyPI. Install it, pass your API key, and start scraping websites from any Python environment — scripts, notebooks, async pipelines, or LangChain RAG workflows.

pip install firescraper

Why a Python SDK?

The REST API works with any language, and the TypeScript SDK has been available since launch. But most AI and ML teams work in Python. If you are building a RAG pipeline with LangChain, training a model with PyTorch, or running data processing in Jupyter notebooks, you should not have to write HTTP requests by hand.

The Python SDK gives you typed methods, automatic polling, progress callbacks, and proper error handling — everything you need to integrate FireScraper into a Python workflow.

Quick Start

from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

# Start a crawl
session = client.scrape(
    name="Documentation crawl",
    urls=["https://docs.example.com/"],
    max_depth=2,
    scraper="article",
)
print(f"Session started: {session.id}")

# Wait for completion with progress updates
result = client.wait_for_completion(
    session.id,
    on_progress=lambda s: print(
        f"  {s.counts.success}/{s.counts.total} pages scraped"
    ),
)
print(f"Done! {result.counts.success} pages scraped")

# Download results as JSON
download = client.get_results(session.id, format="json")
with open("results.json", "wb") as f:
    f.write(download.data)

That is the entire workflow: start a crawl, wait for it to finish, download the results. The SDK handles authentication, polling, and error mapping.

Async Support

For async pipelines, use AsyncFireScraper. It has the same API but every method is a coroutine:

import asyncio
from firescraper import AsyncFireScraper

async def main():
    async with AsyncFireScraper("fsk_your_api_key") as client:
        session = await client.scrape(
            name="Async crawl",
            urls=["https://docs.example.com/"],
            max_depth=2,
        )

        result = await client.wait_for_completion(session.id)
        print(f"Scraped {result.counts.success} pages")

        download = await client.get_results(session.id, format="json")
        print(f"Downloaded {len(download.data)} bytes")

asyncio.run(main())

Use the async client when you are running inside an async framework like FastAPI, or when you want to run multiple crawls concurrently.

LangChain Integration

If you are building a RAG pipeline with LangChain, the FireScraperLoader turns any website into LangChain Document objects in one call:

pip install firescraper langchain-firescraper langchain-core
from langchain_firescraper import FireScraperLoader

loader = FireScraperLoader(
    api_key="fsk_your_api_key",
    urls=["https://docs.example.com/"],
    max_depth=3,
    scraper="article",
)

# Load all pages as LangChain Documents
docs = loader.load()
print(f"Loaded {len(docs)} documents")

for doc in docs[:3]:
    print(f"  {doc.metadata['url']} — {doc.metadata['word_count']} words")

Each Document has:

  • page_content — the extracted text
  • metadataurl, title, word_count, session_id, scraper, source

Plugging into a RAG Pipeline

Here is how you would use FireScraperLoader with a vector store and a retrieval chain:

from langchain_firescraper import FireScraperLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# 1. Scrape the docs
loader = FireScraperLoader(
    api_key="fsk_your_api_key",
    urls=["https://docs.example.com/"],
    max_depth=3,
)
docs = loader.load()

# 2. Chunk the text
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# 4. Query
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(),
)
answer = qa.invoke("How do I authenticate API requests?")
print(answer["result"])

That is a complete RAG pipeline — from raw website to answering questions — in about 20 lines of Python.

Lazy Loading for Large Crawls

For crawls with thousands of pages, use lazy_load() to process documents one at a time without loading everything into memory:

for doc in loader.lazy_load():
    # Process each document as it arrives
    chunks = splitter.split_documents([doc])
    vectorstore.add_documents(chunks)

All SDK Methods

| Method | Description | |---|---| | client.scrape(name, urls, ...) | Start a new crawl session | | client.get_session(session_id) | Get status, page counts, queue depth | | client.wait_for_completion(session_id) | Poll until the crawl finishes | | client.list_results(session_id) | List available export files | | client.get_results(session_id, format) | Download results (json, csv, markdown, zip, ...) | | client.get_partial_results(session_id) | Download mid-crawl results |

Error Handling

The SDK maps HTTP errors to typed exceptions:

from firescraper.exceptions import (
    AuthenticationError,  # 401 — bad or missing API key
    BadRequestError,      # 400 — invalid parameters
    NotFoundError,        # 404 — session not found
    RateLimitError,       # 429 — too many requests
    ServerError,          # 5xx — server-side issue
    TimeoutError,         # request or poll timeout
)

try:
    session = client.scrape(name="Test", urls=["https://example.com"])
except AuthenticationError:
    print("Check your API key")
except RateLimitError:
    print("Slow down — retry after a moment")

Install and Get Started

pip install firescraper

Create an API key from the dashboard, and you are ready to go. The SDK requires Python 3.9+ and has a single dependency (httpx).

Python SDK is live

pip install firescraper. Sync, async, and LangChain integration included. 1,000 free credits to start.