Senso
Sign in

Help-Center Crawler

Crawl a docs site, ingest every page as raw sources, compile a knowledge base, and query it from your terminal.

Help-Center Crawler

Crawl any docs or FAQ site with Firecrawl, ingest every page as a raw source into your knowledge base, then query the compiled knowledge with natural-language questions and get grounded answers with source citations.

Prerequisites

  • A Senso API key (create one on the API Keys page)
  • A Firecrawl API key (get one at firecrawl.dev)
  • Python 3.10+
  • export SENSO_API_KEY="YOUR_API_KEY"
    export FIRECRAWL_KEY="YOUR_FIRECRAWL_KEY"
    pip install requests firecrawl-py rich

    How it works

    1. Crawl — Firecrawl scrapes every page on the target site and returns Markdown (your raw sources) 2. Ingest — For each page, call POST /org/kb/upload to get a presigned S3 URL, then PUT the raw source. Poll GET /org/content/{id} until all sources are compiled. 3. QueryPOST /org/search with natural-language questions. The response includes a grounded answer and the results array of source chunks with relevance scores.

    Full script

    import hashlib, os, sys, time, requests
    from firecrawl import FirecrawlApp
    
    SENSO_API_KEY = os.environ["SENSO_API_KEY"]
    FIRECRAWL_KEY = os.environ["FIRECRAWL_KEY"]
    BASE = "https://apiv2.senso.ai/api/v1"
    HEADERS = {"X-API-Key": SENSO_API_KEY, "Content-Type": "application/json"}
    
    # ── 1. Crawl ──────────────────────────────────────────────────
    
    url = sys.argv[1] if len(sys.argv) > 1 else "https://docs.example.com"
    print(f"Crawling {url}...")
    
    fc = FirecrawlApp(api_key=FIRECRAWL_KEY)
    crawl = fc.crawl_url(url, params={"limit": 50, "scrapeOptions": {"formats": ["markdown"]}})
    pages = crawl.get("data", [])
    print(f"Found {len(pages)} pages")
    
    # ── 2. Ingest ─────────────────────────────────────────────────
    
    content_ids = []
    
    for page in pages:
        md = page.get("markdown", "")
        if not md.strip():
            continue
    
        source_url = page.get("metadata", {}).get("sourceURL", "unknown")
        filename = source_url.rstrip("/").split("/")[-1] or "index"
        filename = f"{filename}.md"
        file_bytes = md.encode("utf-8")
    
        # Request presigned upload URL
        resp = requests.post(f"{BASE}/org/kb/upload", headers=HEADERS, json={
            "files": [{
                "filename": filename,
                "file_size_bytes": len(file_bytes),
                "content_type": "text/markdown",
                "content_hash_md5": hashlib.md5(file_bytes).hexdigest(),
            }]
        })
    
        if resp.status_code != 200:
            print(f"  Skip {filename}: {resp.status_code}")
            continue
    
        result = resp.json()["results"][0]
        if result["status"] != "upload_pending":
            print(f"  Skip {filename}: {result['status']}")
            continue
    
        # Upload to S3
        requests.put(result["upload_url"], data=file_bytes)
        content_ids.append(result["content_id"])
        print(f"  Uploaded {filename} -> {result['content_id']}")
    
    print(f"\nIngested {len(content_ids)} pages. Waiting for processing...")
    
    # Poll until all content is processed
    for cid in content_ids:
        while True:
            item = requests.get(
                f"{BASE}/org/content/{cid}",
                headers={"X-API-Key": SENSO_API_KEY},
            ).json()
            if item.get("processing_status") == "complete":
                break
            time.sleep(1)
    
    print("All content processed.\n")
    
    # ── 3. Query ──────────────────────────────────────────────────
    
    print("Ask questions (Ctrl+C to quit):\n")
    while True:
        try:
            query = input("Q: ").strip()
        except (KeyboardInterrupt, EOFError):
            print()
            break
        if not query:
            continue
    
        resp = requests.post(f"{BASE}/org/search", headers=HEADERS, json={
            "query": query,
            "max_results": 5,
        })
        data = resp.json()
    
        print(f"\nA: {data.get('answer', 'No answer generated.')}\n")
        for r in data.get("results", []):
            print(f"  [{r['score']:.2f}] {r['title']}")
            print(f"         {r['chunk_text'][:120]}...")
        print()

    Run it

    python help_center_crawler.py https://docs.example.com/faq

    Output:

    Crawling https://docs.example.com/faq...
    Found 23 pages
      Uploaded getting-started.md -> a1b2c3d4-...
      Uploaded pricing.md -> e5f6a7b8-...
      ...
    All content processed.

    Ask questions (Ctrl+C to quit):

    Q: How do I reset my password?

    A: To reset your password, go to Settings > Security and click "Reset password". You'll receive an email with a reset link valid for 24 hours.

    [0.95] account-security.md To reset your password, navigate to Settings > Security. Click the "Reset... [0.82] getting-started.md During first login, you'll be prompted to set a password. If you forget it...

    Key API details

    Ingestion request — each file needs four fields:

    FieldTypeDescription
    filenamestringOriginal filename
    file_size_bytesintegerMust be >= 1
    content_typestringMIME type (e.g. text/markdown, application/pdf)
    content_hash_md5string32-character hex MD5 digest
    Query response — the fields you'll use most:

    FieldDescription
    answerGrounded AI-generated answer from your compiled knowledge base
    results[].chunk_textThe actual text from the matching chunk
    results[].scoreRelevance score (0-1)
    results[].titleSource document title
    results[].content_idUUID of the source content item
    total_resultsTotal matches found (before max_results cap)
    processing_time_msHow long the search took