Help-Center Crawler

Crawl a docs site, ingest every page as raw sources, compile a knowledge base, and query it from your terminal.

Help-Center Crawler

Crawl any docs or FAQ site with Firecrawl, ingest every page as a raw source into your knowledge base, then query the compiled knowledge with natural-language questions and get grounded answers with source citations.

Prerequisites

A Senso API key (create one on the API Keys page)

A Firecrawl API key (get one at firecrawl.dev)

Python 3.10+

export SENSO_API_KEY="YOUR_API_KEY"
export FIRECRAWL_KEY="YOUR_FIRECRAWL_KEY"
pip install requests firecrawl-py rich

How it works

1. Crawl — Firecrawl scrapes every page on the target site and returns Markdown (your raw sources) 2. Ingest — For each page, call POST /org/kb/upload to get a presigned S3 URL, then PUT the raw source. Poll each file's KB node (GET /org/kb/nodes/{id}/content) until processing_status is complete. 3. Query — POST /org/search with natural-language questions. The response includes a grounded answer and the results array of source chunks with relevance scores.

Full script

import hashlib, os, sys, time, requests
from firecrawl import FirecrawlApp

SENSO_API_KEY = os.environ["SENSO_API_KEY"]
FIRECRAWL_KEY = os.environ["FIRECRAWL_KEY"]
BASE = "https://apiv2.senso.ai/api/v1"
HEADERS = {"X-API-Key": SENSO_API_KEY, "Content-Type": "application/json"}

# ── 1. Crawl ──────────────────────────────────────────────────

url = sys.argv[1] if len(sys.argv) > 1 else "https://docs.example.com"
print(f"Crawling {url}...")

fc = FirecrawlApp(api_key=FIRECRAWL_KEY)
crawl = fc.crawl_url(url, params={"limit": 50, "scrapeOptions": {"formats": ["markdown"]}})
pages = crawl.get("data", [])
print(f"Found {len(pages)} pages")

# ── 2. Ingest ─────────────────────────────────────────────────

content_ids = []

for page in pages:
    md = page.get("markdown", "")
    if not md.strip():
        continue

    source_url = page.get("metadata", {}).get("sourceURL", "unknown")
    filename = source_url.rstrip("/").split("/")[-1] or "index"
    filename = f"{filename}.md"
    file_bytes = md.encode("utf-8")

    # Request presigned upload URL
    resp = requests.post(f"{BASE}/org/kb/upload", headers=HEADERS, json={
        "files": [{
            "filename": filename,
            "file_size_bytes": len(file_bytes),
            "content_type": "text/markdown",
            "content_hash_md5": hashlib.md5(file_bytes).hexdigest(),
        }]
    })

    if resp.status_code != 200:
        print(f"  Skip {filename}: {resp.status_code}")
        continue

    result = resp.json()["results"][0]
    if result["status"] != "upload_pending":
        print(f"  Skip {filename}: {result['status']}")
        continue

    # Upload to S3. Content-Type must match the value declared above, or
    # the presigned (signed) request is rejected by S3.
    requests.put(result["upload_url"], data=file_bytes,
                 headers={"Content-Type": "text/markdown"})
    content_ids.append(result["content_id"])
    print(f"  Uploaded {filename} -> {result['content_id']}")

print(f"\nIngested {len(content_ids)} pages. Waiting for processing...")

# /org/kb/upload returns CONTENT ids, but the /org/kb/nodes/{id}/... endpoints
# need NODE ids. Build a content_id -> kb_node_id map from the KB tree.
tree = requests.get(f"{BASE}/org/kb/my-files", headers=HEADERS,
                    params={"limit": 500}).json()["nodes"]
node_by_content = {n["content_id"]: n["kb_node_id"] for n in tree if n.get("content_id")}

# Poll each uploaded file's KB node until ingestion finishes. (GET /org/content/{id}
# is for generated content; KB sources are read via /org/kb/nodes/{id}/content.)
for cid in content_ids:
    node_id = node_by_content.get(cid)
    if not node_id:
        continue
    while True:
        node = requests.get(f"{BASE}/org/kb/nodes/{node_id}/content", headers=HEADERS).json()
        if node.get("processing_status") == "complete":
            break
        time.sleep(1)

print("All content processed.\n")

# ── 3. Query ──────────────────────────────────────────────────

print("Ask questions (Ctrl+C to quit):\n")
while True:
    try:
        query = input("Q: ").strip()
    except (KeyboardInterrupt, EOFError):
        print()
        break
    if not query:
        continue

    resp = requests.post(f"{BASE}/org/search", headers=HEADERS, json={
        "query": query,
        "max_results": 5,
    })
    data = resp.json()

    print(f"\nA: {data.get('answer', 'No answer generated.')}\n")
    for r in data.get("results", []):
        print(f"  [{r['score']:.2f}] {r['title']}")
        print(f"         {r['chunk_text'][:120]}...")
    print()

Run it

python help_center_crawler.py https://docs.example.com/faq

Output:

Crawling https://docs.example.com/faq... Found 23 pages Uploaded getting-started.md -> a1b2c3d4-... Uploaded pricing.md -> e5f6a7b8-... ... All content processed. Ask questions (Ctrl+C to quit): Q: How do I reset my password? A: To reset your password, go to Settings > Security and click "Reset password". You'll receive an email with a reset link valid for 24 hours.

[0.95] account-security.md To reset your password, navigate to Settings > Security. Click the "Reset... [0.82] getting-started.md During first login, you'll be prompted to set a password. If you forget it...

Key API details

Ingestion request — each file needs four fields:

Field	Type	Description
`filename`	string	Original filename
`file_size_bytes`	integer	Must be >= 1
`content_type`	string	MIME type (e.g. `text/markdown`, `application/pdf`)
`content_hash_md5`	string	32-character hex MD5 digest

Query response — the fields you'll use most:

Field	Description
`answer`	Grounded AI-generated answer from your compiled knowledge base
`results[].chunk_text`	The actual text from the matching chunk
`results[].score`	Relevance score (0-1)
`results[].title`	Source document title
`results[].content_id`	UUID of the source content item
`total_results`	Total matches found (before `max_results` cap)
`processing_time_ms`	How long the search took