How to Build a RAG Pipeline with FireScraper

April 10, 20265 min read

tutorial

rag

guide

Retrieval-augmented generation (RAG) starts with data. Before you can build a knowledge base, you need clean, structured text from the web. This guide walks you through the entire pipeline: crawling websites with FireScraper, exporting the results, and loading them into a vector database for RAG.

What You Will Build

By the end of this tutorial, you will have:

A FireScraper crawl that turns a documentation site into clean JSONL
Text chunks loaded into a vector database
A working RAG query that answers questions using the crawled content

Prerequisites

A FireScraper account (sign up free — 1,000 crawl units, no credit card)
Node.js 18+ installed
An OpenAI API key (for embeddings and completions)

Step 1: Install the SDK

npm install @firescraper/sdk

Step 2: Start a Crawl

import { FireScraper } from '@firescraper/sdk';

const client = new FireScraper({ apiKey: 'fsk_your_api_key' });

// Crawl a documentation site
const session = await client.scrape({
  name: 'docs-crawl',
  urls: ['https://docs.example.com'],
  maxDepth: 2,
  scraper: 'article',
  minTextLength: 100,
});

console.log(`Crawl started: ${session.id}`);

The maxDepth: 2 setting follows links up to two hops from the seed URL. The scraper: 'article' mode extracts the main content, stripping navigation, footers, and sidebars. Pages with fewer than 100 words are filtered out.

Step 3: Wait for Completion

const result = await client.waitForCompletion(session.id, {
  pollInterval: 5000,
  onProgress: (status) => {
    const { success, total } = status.counts;
    console.log(`Progress: ${success}/${total} pages scraped`);
  },
});

console.log(`Crawl complete: ${result.counts.success} pages`);

The SDK polls every 5 seconds and calls your progress callback with live counts. A typical documentation site with 50-100 pages completes in under two minutes.

Step 4: Download Results as JSONL

const download = await client.getResults(session.id, 'jsonl');

// Save to disk
import { writeFileSync } from 'fs';
writeFileSync('crawl-results.jsonl', Buffer.from(download.data));

console.log(`Downloaded: ${download.fileName} (${download.contentType})`);

The JSONL format gives you one JSON object per line, each containing the page URL, title, and extracted text. This format is ideal for streaming into embedding pipelines.

Step 5: Chunk the Text

Before embedding, split long documents into smaller chunks. Here is a simple approach:

import { readFileSync } from 'fs';

interface CrawlResult {
  url: string;
  title: string;
  text: string;
}

function chunkText(text: string, maxChars = 1500, overlap = 200): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + maxChars, text.length);
    chunks.push(text.slice(start, end));
    start = end - overlap;
  }
  return chunks;
}

// Read and chunk
const lines = readFileSync('crawl-results.jsonl', 'utf-8').trim().split('\n');
const documents = lines.map((line) => JSON.parse(line) as CrawlResult);

const chunks = documents.flatMap((doc) =>
  chunkText(doc.text).map((chunk, i) => ({
    text: chunk,
    metadata: { url: doc.url, title: doc.title, chunkIndex: i },
  }))
);

console.log(`Created ${chunks.length} chunks from ${documents.length} pages`);

Step 6: Embed and Store

Using OpenAI embeddings and a simple in-memory store (swap for Pinecone, Chroma, or Weaviate in production):

import OpenAI from 'openai';

const openai = new OpenAI();

// Generate embeddings
const embeddings = await Promise.all(
  chunks.map(async (chunk) => {
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunk.text,
    });
    return {
      ...chunk,
      embedding: response.data[0].embedding,
    };
  })
);

console.log(`Embedded ${embeddings.length} chunks`);

Step 7: Query Your Knowledge Base

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

async function query(question: string, topK = 3) {
  // Embed the question
  const qEmbed = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  });
  const qVector = qEmbed.data[0].embedding;

  // Find most similar chunks
  const ranked = embeddings
    .map((e) => ({ ...e, score: cosineSimilarity(qVector, e.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);

  // Generate answer
  const context = ranked.map((r) => r.text).join('\n\n---\n\n');
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: `Answer based on this context:\n\n${context}` },
      { role: 'user', content: question },
    ],
  });

  return {
    answer: completion.choices[0].message.content,
    sources: ranked.map((r) => r.metadata.url),
  };
}

// Try it
const result = await query('How do I configure authentication?');
console.log(result.answer);
console.log('Sources:', result.sources);

Step 8: Keep It Fresh with Scheduled Crawls

RAG pipelines are only as good as their data. Set up a scheduled crawl in the FireScraper dashboard to re-crawl your sources daily, weekly, or monthly. When the crawl completes, a webhook notifies your pipeline to re-embed the updated content.

const session = await client.scrape({
  name: 'docs-crawl-weekly',
  urls: ['https://docs.example.com'],
  maxDepth: 2,
  scraper: 'article',
  webhookUrl: 'https://your-api.com/webhook/crawl-complete',
});

Your webhook endpoint receives a POST when the crawl finishes, including the session ID. Use it to trigger a re-embedding job automatically.

Next Steps

Structured extraction: Define a JSON schema to pull specific fields (title, summary, code examples) alongside full text
Markdown export: Use the Markdown format for LLM-optimized output that uses fewer tokens
Multiple sources: Crawl documentation sites, blog archives, and knowledge bases into a single unified index
Production vector DB: Replace the in-memory store with Pinecone, Chroma, Weaviate, or Qdrant

Start building your RAG pipeline

1,000 free crawl units. No credit card required. Clean text exports in JSONL, Markdown, and more.

Start scraping free Read the API docs