Markdown vs JSONL vs CSV: Choosing the Right Export Format for Your LLM Pipeline

April 17, 20266 min read

guide

formats

rag

You have crawled a website and now you are staring at five export options: Markdown, JSONL, CSV, JSON, and ZIP. Which one should you pick?

The answer depends on what you are building. Each format is optimized for a different step in the AI pipeline. Choosing wrong means extra processing work — or worse, wasted tokens.

FireScraper session results showing all available download formats

The Formats at a Glance

| Format | What You Get | Best For | |---|---|---| | Markdown | One .md file per page | LLM context windows, fine-tuning data | | JSONL | One JSON object per line | Embedding pipelines, streaming ingestion | | CSV | Tabular data (URL, title, text, word count) | Spreadsheet analysis, database import | | JSON | Single JSON array of all pages | Structured processing, APIs | | ZIP | Bundle of individual text files | Archiving, offline processing |

Markdown: When Tokens Matter

Markdown is the most token-efficient format for passing scraped content to an LLM.

Raw HTML from a typical web page is 60-80% markup — tags, classes, attributes, scripts, styles. None of that is useful to a language model. Markdown strips all of it and keeps the semantic structure: headings, paragraphs, lists, code blocks, links.

Use Markdown when:

You are passing scraped content into an LLM context window (Claude, GPT-4, Gemini)
You are building a fine-tuning dataset where every token costs money
You want human-readable output that is also machine-readable

Example output:

# Getting Started with Authentication

To authenticate API requests, include your API key in the Authorization header.

## Quick Start

1. Generate an API key in Settings > API Keys
2. Add the header to your requests:

\`\`\`bash
curl -H "Authorization: Bearer your_key" https://api.example.com/data
\`\`\`

## Rate Limits

Free tier: 100 requests/minute
Paid tier: 1,000 requests/minute

Compare that to the raw HTML of the same page, which might be 10x longer with <div>, <nav>, <footer>, <script>, and CSS class names consuming tokens that add zero value.

JSONL: When You Are Building an Embedding Pipeline

JSONL (JSON Lines) gives you one JSON object per line. Each object contains the page URL, title, extracted text, word count, and any structured extraction fields.

Use JSONL when:

You are streaming data into an embedding pipeline (OpenAI, Cohere, Voyage)
You need to process pages one at a time without loading everything into memory
You are building a RAG system and need metadata alongside the text

Example output:

{"url":"https://docs.example.com/auth","title":"Authentication","text":"To authenticate API requests...","wordCount":342}
{"url":"https://docs.example.com/rate-limits","title":"Rate Limits","text":"Free tier allows 100 requests...","wordCount":156}

The killer feature of JSONL: you can process it line by line. Read one line, parse it, embed it, store it. You never need to hold the entire dataset in memory. For a crawl with 10,000 pages, this matters.

Typical JSONL pipeline:

import { createReadStream } from 'fs';
import { createInterface } from 'readline';

const rl = createInterface({
  input: createReadStream('results.jsonl'),
});

for await (const line of rl) {
  const page = JSON.parse(line);

  // Chunk the text
  const chunks = chunkText(page.text, 1500);

  // Embed each chunk
  for (const chunk of chunks) {
    const embedding = await embed(chunk);
    await vectorDb.upsert({
      id: `${page.url}#${chunks.indexOf(chunk)}`,
      values: embedding,
      metadata: { url: page.url, title: page.title },
    });
  }
}

CSV: When You Need a Spreadsheet View

CSV is the most universal format. It opens in Excel, Google Sheets, Pandas, SQL databases — anything that handles tabular data.

Use CSV when:

You want to browse crawl results in a spreadsheet
You are loading data into a SQL database
You need to share results with non-technical teammates
You are doing exploratory analysis before deciding how to process the data

Example output:

url,title,text,wordCount,statusCode
https://docs.example.com/auth,Authentication,"To authenticate...",342,200
https://docs.example.com/rate-limits,Rate Limits,"Free tier...",156,200

CSV is also great for quick quality checks. Open the file, sort by word count, and immediately see which pages have too little content (possible scraping issues) or too much (might need chunking).

JSON: When You Need the Full Picture

JSON gives you a single array containing all pages. Unlike JSONL, it is a valid JSON document you can parse in one call.

Use JSON when:

You are building an API that serves scraped data
You want to load everything into memory and process it as a batch
You need structured extraction results (nested objects, arrays)

When to avoid JSON: If your crawl has thousands of pages, the JSON file can be large. JSONL is more memory-efficient for big datasets.

ZIP: When You Want Individual Files

ZIP gives you one text file per page, bundled together. Each file is named after the page URL (sanitized).

Use ZIP when:

You want to browse individual pages as files
You are feeding pages into a system that expects one-file-per-document
You want an archive for offline processing or backup

Decision Flowchart

"I need to pass content to an LLM" → Markdown

"I am building a RAG / embedding pipeline" → JSONL

"I want to analyze or explore the data" → CSV

"I need it in a web API or batch processor" → JSON

"I want individual files per page" → ZIP

"I am not sure yet" → Start with JSONL. It is the most flexible — you can convert JSONL to any other format with a few lines of code, but you cannot easily go backward from CSV or Markdown to structured JSON.

Using Multiple Formats

You are not limited to one. FireScraper lets you download the same crawl in multiple formats. A common pattern:

JSONL for your production embedding pipeline
CSV for your data team to review quality
Markdown for feeding specific pages into LLM prompts

You can also download partial results while a crawl is still running — useful for large crawls where you want to start processing before the entire site is scraped.

Export in any format

Markdown, JSONL, CSV, JSON, ZIP — all included with every crawl. 1,000 free units to start.

Start scraping free See all export options