Retrieval-augmented generation (RAG) starts with data. Before you can build a knowledge base, you need clean, structured text from the web. This guide walks you through the entire pipeline: crawling websites with FireScraper, exporting the results, and loading them into a vector database for RAG.
What You Will Build
By the end of this tutorial, you will have:
- A FireScraper crawl that turns a documentation site into clean JSONL
- Text chunks loaded into a vector database
- A working RAG query that answers questions using the crawled content
Prerequisites
- A FireScraper account (sign up free — 1,000 crawl units, no credit card)
- Node.js 18+ installed
- An OpenAI API key (for embeddings and completions)
Step 1: Install the SDK
npm install @firescraper/sdk
Step 2: Start a Crawl
import { FireScraper } from '@firescraper/sdk';
const client = new FireScraper({ apiKey: 'fsk_your_api_key' });
// Crawl a documentation site
const session = await client.scrape({
name: 'docs-crawl',
urls: ['https://docs.example.com'],
maxDepth: 2,
scraper: 'article',
minTextLength: 100,
});
console.log(`Crawl started: ${session.id}`);
The maxDepth: 2 setting follows links up to two hops from the seed URL. The scraper: 'article' mode extracts the main content, stripping navigation, footers, and sidebars. Pages with fewer than 100 words are filtered out.
Step 3: Wait for Completion
const result = await client.waitForCompletion(session.id, {
pollInterval: 5000,
onProgress: (status) => {
const { success, total } = status.counts;
console.log(`Progress: ${success}/${total} pages scraped`);
},
});
console.log(`Crawl complete: ${result.counts.success} pages`);
The SDK polls every 5 seconds and calls your progress callback with live counts. A typical documentation site with 50-100 pages completes in under two minutes.
Step 4: Download Results as JSONL
const download = await client.getResults(session.id, 'jsonl');
// Save to disk
import { writeFileSync } from 'fs';
writeFileSync('crawl-results.jsonl', Buffer.from(download.data));
console.log(`Downloaded: ${download.fileName} (${download.contentType})`);
The JSONL format gives you one JSON object per line, each containing the page URL, title, and extracted text. This format is ideal for streaming into embedding pipelines.
Step 5: Chunk the Text
Before embedding, split long documents into smaller chunks. Here is a simple approach:
import { readFileSync } from 'fs';
interface CrawlResult {
url: string;
title: string;
text: string;
}
function chunkText(text: string, maxChars = 1500, overlap = 200): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + maxChars, text.length);
chunks.push(text.slice(start, end));
start = end - overlap;
}
return chunks;
}
// Read and chunk
const lines = readFileSync('crawl-results.jsonl', 'utf-8').trim().split('\n');
const documents = lines.map((line) => JSON.parse(line) as CrawlResult);
const chunks = documents.flatMap((doc) =>
chunkText(doc.text).map((chunk, i) => ({
text: chunk,
metadata: { url: doc.url, title: doc.title, chunkIndex: i },
}))
);
console.log(`Created ${chunks.length} chunks from ${documents.length} pages`);
Step 6: Embed and Store
Using OpenAI embeddings and a simple in-memory store (swap for Pinecone, Chroma, or Weaviate in production):
import OpenAI from 'openai';
const openai = new OpenAI();
// Generate embeddings
const embeddings = await Promise.all(
chunks.map(async (chunk) => {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunk.text,
});
return {
...chunk,
embedding: response.data[0].embedding,
};
})
);
console.log(`Embedded ${embeddings.length} chunks`);
Step 7: Query Your Knowledge Base
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
async function query(question: string, topK = 3) {
// Embed the question
const qEmbed = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
});
const qVector = qEmbed.data[0].embedding;
// Find most similar chunks
const ranked = embeddings
.map((e) => ({ ...e, score: cosineSimilarity(qVector, e.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
// Generate answer
const context = ranked.map((r) => r.text).join('\n\n---\n\n');
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: `Answer based on this context:\n\n${context}` },
{ role: 'user', content: question },
],
});
return {
answer: completion.choices[0].message.content,
sources: ranked.map((r) => r.metadata.url),
};
}
// Try it
const result = await query('How do I configure authentication?');
console.log(result.answer);
console.log('Sources:', result.sources);
Step 8: Keep It Fresh with Scheduled Crawls
RAG pipelines are only as good as their data. Set up a scheduled crawl in the FireScraper dashboard to re-crawl your sources daily, weekly, or monthly. When the crawl completes, a webhook notifies your pipeline to re-embed the updated content.
const session = await client.scrape({
name: 'docs-crawl-weekly',
urls: ['https://docs.example.com'],
maxDepth: 2,
scraper: 'article',
webhookUrl: 'https://your-api.com/webhook/crawl-complete',
});
Your webhook endpoint receives a POST when the crawl finishes, including the session ID. Use it to trigger a re-embedding job automatically.
Next Steps
- Structured extraction: Define a JSON schema to pull specific fields (title, summary, code examples) alongside full text
- Markdown export: Use the Markdown format for LLM-optimized output that uses fewer tokens
- Multiple sources: Crawl documentation sites, blog archives, and knowledge bases into a single unified index
- Production vector DB: Replace the in-memory store with Pinecone, Chroma, Weaviate, or Qdrant
Start building your RAG pipeline
1,000 free crawl units. No credit card required. Clean text exports in JSONL, Markdown, and more.