← Back to blog

How to Build a RAG Pipeline with FireScraper

5 min read
tutorial
rag
guide

Retrieval-augmented generation (RAG) starts with data. Before you can build a knowledge base, you need clean, structured text from the web. This guide walks you through the entire pipeline: crawling websites with FireScraper, exporting the results, and loading them into a vector database for RAG.

What You Will Build

By the end of this tutorial, you will have:

  1. A FireScraper crawl that turns a documentation site into clean JSONL
  2. Text chunks loaded into a vector database
  3. A working RAG query that answers questions using the crawled content

Prerequisites

  • A FireScraper account (sign up free — 1,000 crawl units, no credit card)
  • Node.js 18+ installed
  • An OpenAI API key (for embeddings and completions)

Step 1: Install the SDK

npm install @firescraper/sdk

Step 2: Start a Crawl

import { FireScraper } from '@firescraper/sdk';

const client = new FireScraper({ apiKey: 'fsk_your_api_key' });

// Crawl a documentation site
const session = await client.scrape({
  name: 'docs-crawl',
  urls: ['https://docs.example.com'],
  maxDepth: 2,
  scraper: 'article',
  minTextLength: 100,
});

console.log(`Crawl started: ${session.id}`);

The maxDepth: 2 setting follows links up to two hops from the seed URL. The scraper: 'article' mode extracts the main content, stripping navigation, footers, and sidebars. Pages with fewer than 100 words are filtered out.

Step 3: Wait for Completion

const result = await client.waitForCompletion(session.id, {
  pollInterval: 5000,
  onProgress: (status) => {
    const { success, total } = status.counts;
    console.log(`Progress: ${success}/${total} pages scraped`);
  },
});

console.log(`Crawl complete: ${result.counts.success} pages`);

The SDK polls every 5 seconds and calls your progress callback with live counts. A typical documentation site with 50-100 pages completes in under two minutes.

Step 4: Download Results as JSONL

const download = await client.getResults(session.id, 'jsonl');

// Save to disk
import { writeFileSync } from 'fs';
writeFileSync('crawl-results.jsonl', Buffer.from(download.data));

console.log(`Downloaded: ${download.fileName} (${download.contentType})`);

The JSONL format gives you one JSON object per line, each containing the page URL, title, and extracted text. This format is ideal for streaming into embedding pipelines.

Step 5: Chunk the Text

Before embedding, split long documents into smaller chunks. Here is a simple approach:

import { readFileSync } from 'fs';

interface CrawlResult {
  url: string;
  title: string;
  text: string;
}

function chunkText(text: string, maxChars = 1500, overlap = 200): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + maxChars, text.length);
    chunks.push(text.slice(start, end));
    start = end - overlap;
  }
  return chunks;
}

// Read and chunk
const lines = readFileSync('crawl-results.jsonl', 'utf-8').trim().split('\n');
const documents = lines.map((line) => JSON.parse(line) as CrawlResult);

const chunks = documents.flatMap((doc) =>
  chunkText(doc.text).map((chunk, i) => ({
    text: chunk,
    metadata: { url: doc.url, title: doc.title, chunkIndex: i },
  }))
);

console.log(`Created ${chunks.length} chunks from ${documents.length} pages`);

Step 6: Embed and Store

Using OpenAI embeddings and a simple in-memory store (swap for Pinecone, Chroma, or Weaviate in production):

import OpenAI from 'openai';

const openai = new OpenAI();

// Generate embeddings
const embeddings = await Promise.all(
  chunks.map(async (chunk) => {
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunk.text,
    });
    return {
      ...chunk,
      embedding: response.data[0].embedding,
    };
  })
);

console.log(`Embedded ${embeddings.length} chunks`);

Step 7: Query Your Knowledge Base

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

async function query(question: string, topK = 3) {
  // Embed the question
  const qEmbed = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  });
  const qVector = qEmbed.data[0].embedding;

  // Find most similar chunks
  const ranked = embeddings
    .map((e) => ({ ...e, score: cosineSimilarity(qVector, e.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);

  // Generate answer
  const context = ranked.map((r) => r.text).join('\n\n---\n\n');
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: `Answer based on this context:\n\n${context}` },
      { role: 'user', content: question },
    ],
  });

  return {
    answer: completion.choices[0].message.content,
    sources: ranked.map((r) => r.metadata.url),
  };
}

// Try it
const result = await query('How do I configure authentication?');
console.log(result.answer);
console.log('Sources:', result.sources);

Step 8: Keep It Fresh with Scheduled Crawls

RAG pipelines are only as good as their data. Set up a scheduled crawl in the FireScraper dashboard to re-crawl your sources daily, weekly, or monthly. When the crawl completes, a webhook notifies your pipeline to re-embed the updated content.

const session = await client.scrape({
  name: 'docs-crawl-weekly',
  urls: ['https://docs.example.com'],
  maxDepth: 2,
  scraper: 'article',
  webhookUrl: 'https://your-api.com/webhook/crawl-complete',
});

Your webhook endpoint receives a POST when the crawl finishes, including the session ID. Use it to trigger a re-embedding job automatically.

Next Steps

  • Structured extraction: Define a JSON schema to pull specific fields (title, summary, code examples) alongside full text
  • Markdown export: Use the Markdown format for LLM-optimized output that uses fewer tokens
  • Multiple sources: Crawl documentation sites, blog archives, and knowledge bases into a single unified index
  • Production vector DB: Replace the in-memory store with Pinecone, Chroma, Weaviate, or Qdrant

Start building your RAG pipeline

1,000 free crawl units. No credit card required. Clean text exports in JSONL, Markdown, and more.