Help-Center Crawler
Crawl a docs site, ingest every page as raw sources, compile a knowledge base, and query it from your terminal.
Help-Center Crawler
Crawl any docs or FAQ site with Firecrawl, ingest every page as a raw source into your knowledge base, then query the compiled knowledge with natural-language questions and get grounded answers with source citations.
Prerequisites
export SENSO_API_KEY="YOUR_API_KEY"
export FIRECRAWL_KEY="YOUR_FIRECRAWL_KEY"
pip install requests firecrawl-py richHow it works
1. Crawl — Firecrawl scrapes every page on the target site and returns Markdown (your raw sources)
2. Ingest — For each page, call POST /org/kb/upload to get a presigned S3 URL, then PUT the raw source. Poll GET /org/content/{id} until all sources are compiled.
3. Query — POST /org/search with natural-language questions. The response includes a grounded answer and the results array of source chunks with relevance scores.
Full script
import hashlib, os, sys, time, requests
from firecrawl import FirecrawlApp
SENSO_API_KEY = os.environ["SENSO_API_KEY"]
FIRECRAWL_KEY = os.environ["FIRECRAWL_KEY"]
BASE = "https://apiv2.senso.ai/api/v1"
HEADERS = {"X-API-Key": SENSO_API_KEY, "Content-Type": "application/json"}
# ── 1. Crawl ──────────────────────────────────────────────────
url = sys.argv[1] if len(sys.argv) > 1 else "https://docs.example.com"
print(f"Crawling {url}...")
fc = FirecrawlApp(api_key=FIRECRAWL_KEY)
crawl = fc.crawl_url(url, params={"limit": 50, "scrapeOptions": {"formats": ["markdown"]}})
pages = crawl.get("data", [])
print(f"Found {len(pages)} pages")
# ── 2. Ingest ─────────────────────────────────────────────────
content_ids = []
for page in pages:
md = page.get("markdown", "")
if not md.strip():
continue
source_url = page.get("metadata", {}).get("sourceURL", "unknown")
filename = source_url.rstrip("/").split("/")[-1] or "index"
filename = f"{filename}.md"
file_bytes = md.encode("utf-8")
# Request presigned upload URL
resp = requests.post(f"{BASE}/org/kb/upload", headers=HEADERS, json={
"files": [{
"filename": filename,
"file_size_bytes": len(file_bytes),
"content_type": "text/markdown",
"content_hash_md5": hashlib.md5(file_bytes).hexdigest(),
}]
})
if resp.status_code != 200:
print(f" Skip {filename}: {resp.status_code}")
continue
result = resp.json()["results"][0]
if result["status"] != "upload_pending":
print(f" Skip {filename}: {result['status']}")
continue
# Upload to S3
requests.put(result["upload_url"], data=file_bytes)
content_ids.append(result["content_id"])
print(f" Uploaded {filename} -> {result['content_id']}")
print(f"\nIngested {len(content_ids)} pages. Waiting for processing...")
# Poll until all content is processed
for cid in content_ids:
while True:
item = requests.get(
f"{BASE}/org/content/{cid}",
headers={"X-API-Key": SENSO_API_KEY},
).json()
if item.get("processing_status") == "complete":
break
time.sleep(1)
print("All content processed.\n")
# ── 3. Query ──────────────────────────────────────────────────
print("Ask questions (Ctrl+C to quit):\n")
while True:
try:
query = input("Q: ").strip()
except (KeyboardInterrupt, EOFError):
print()
break
if not query:
continue
resp = requests.post(f"{BASE}/org/search", headers=HEADERS, json={
"query": query,
"max_results": 5,
})
data = resp.json()
print(f"\nA: {data.get('answer', 'No answer generated.')}\n")
for r in data.get("results", []):
print(f" [{r['score']:.2f}] {r['title']}")
print(f" {r['chunk_text'][:120]}...")
print()Run it
python help_center_crawler.py https://docs.example.com/faqOutput:
Crawling https://docs.example.com/faq...
Found 23 pages
Uploaded getting-started.md -> a1b2c3d4-...
Uploaded pricing.md -> e5f6a7b8-...
...
All content processed.Ask questions (Ctrl+C to quit):
Q: How do I reset my password?
A: To reset your password, go to Settings > Security and click "Reset password".
You'll receive an email with a reset link valid for 24 hours.
[0.95] account-security.md
To reset your password, navigate to Settings > Security. Click the "Reset...
[0.82] getting-started.md
During first login, you'll be prompted to set a password. If you forget it...
Key API details
Ingestion request — each file needs four fields:
| Field | Type | Description |
|---|---|---|
filename | string | Original filename |
file_size_bytes | integer | Must be >= 1 |
content_type | string | MIME type (e.g. text/markdown, application/pdf) |
content_hash_md5 | string | 32-character hex MD5 digest |
| Field | Description |
|---|---|
answer | Grounded AI-generated answer from your compiled knowledge base |
results[].chunk_text | The actual text from the matching chunk |
results[].score | Relevance score (0-1) |
results[].title | Source document title |
results[].content_id | UUID of the source content item |
total_results | Total matches found (before max_results cap) |
processing_time_ms | How long the search took |
