Skip to main content

Overview

When you add a URL Source, Halo uses a web crawler to fetch and process pages. This page covers the technical details. Useful if you need to allowlist the crawler in your firewall, CDN, or bot protection.

Crawler identity

PropertyValue
User-AgentHaloAgentsAI-Crawler/1.0 (+https://docs.haloagents.ai/knowledge/crawler)
Refererhttps://docs.haloagents.ai/knowledge/crawler (never Google or other organic search URLs)
Trace headerX-Halo-Crawler: knowledge-ingest
Respects robots.txtYes
JavaScript renderingYes (via Firecrawl for knowledge URL sources)
Firecrawl proxybasic only (unless FIRECRAWL_ALLOW_SPOOFING_PROXY=true)
Every outbound knowledge crawl uses the headers above on direct fetches and on every Firecrawl scrape / startCrawl call. We do not simulate Google organic traffic. Enhanced / stealth Firecrawl proxy modes are disabled by default because they can inject misleading browser fingerprints (including Referer: google.com).

Allowlisting the crawler

If your site uses bot protection (Cloudflare, Akamai, AWS WAF, Sucuri, etc.), you may need to allowlist the Halo crawler.

By User-Agent

Add HaloAgentsAI-Crawler to your bot protection allowlist. Steps depend on your provider:
  1. Go to Security > WAF > Custom Rules
  2. Create a new rule with the expression:
    (http.user_agent contains "HaloAgentsAI-Crawler")
    
  3. Set the action to Skip (or Allow)
  1. Go to your Web ACL > Rules
  2. Add a rule that matches the User-Agent header containing HaloAgentsAI-Crawler
  3. Set the action to Allow
if ($http_user_agent ~* "HaloAgentsAI-Crawler") {
  set $is_bot 0;
}
SetEnvIf User-Agent "HaloAgentsAI-Crawler" allowed_bot
Order Deny,Allow
Deny from all
Allow from env=allowed_bot

Via robots.txt

The crawler respects standard robots.txt rules. To explicitly allow Halo while blocking other bots:
User-agent: HaloAgentsAI-Crawler
Allow: /

User-agent: *
Disallow: /private/
To block specific paths:
User-agent: HaloAgentsAI-Crawler
Disallow: /admin/
Disallow: /internal/
Allow: /

Crawler behavior

Discovery and incremental sync

When you add a site, Halo runs a one-time Firecrawl crawl to ingest everything. After that, a daily sitemap sync keeps the knowledge base in sync without re-crawling pages that haven’t changed:
  1. Fetch the site’s sitemap. Tried in order: the sitemap URL we last used, /sitemap.xml, /sitemap_index.xml, /sitemap/sitemap.xml, then Sitemap: directives in robots.txt. Sitemap index files are followed recursively.
  2. Diff the URL list against what we already have:
    • New URL in sitemap: insert a page row and scrape it.
    • <lastmod> newer than our last successful crawl: re-scrape and refresh the embeddings.
    • URL no longer in sitemap for 30+ days: soft-remove the page and delete its embeddings.
    • Everything else: do nothing. No HTTP request to the page at all.
  3. Update the page’s sitemap_lastmod and sitemap_last_seen_at so the next sync knows what’s actually changed.
This means a typical day generates exactly one sitemap fetch per site, plus Firecrawl scrapes only for pages whose publisher signalled a change.

Fallback for sites without a sitemap

If we can’t find a working sitemap (no /sitemap.xml, no robots.txt directive, or the file returns nothing parseable), the site falls back to a quarterly full Firecrawl crawl so new pages still surface. Most documentation hosts (Mintlify, Docusaurus, Intercom, Zendesk, GitBook, Notion-published sites) publish sitemaps automatically, so the fallback rarely fires.

Request pattern

PropertyValue
Daily sitemap syncOne sitemap.xml fetch per site, at 06:00 UTC
Page scrapesOnly when sitemap <lastmod> is newer than our last crawl, or the URL is new
Full re-discovery fallbackEvery 90 days for sites without a working sitemap
Request rateUp to 200 requests/min per domain
Concurrent requestsUp to 10 per organization
Request timeout15 seconds per page
ProtocolsHTTPS and HTTP
Content typesHTML only
The crawler is designed to be respectful: rate-limited per domain, concurrency-capped per org, and driven by your sitemap so we never re-fetch content that hasn’t changed.

What gets extracted

From each page:
  • Page text: clean body content with nav, headers, footers, and boilerplate removed
  • Title and description: from meta tags
  • Images: URLs, alt text, and surrounding context
  • Videos: YouTube and Vimeo embeds, with automatic transcript fetching when available

Change detection (defense in depth)

Even when the sitemap signals a change, we hash the normalized content after extraction. If the hash matches what we already have, we skip re-embedding so cosmetic edits (build dates, copyright years, whitespace) don’t burn embedding tokens.

Troubleshooting

Check your server logs for the HaloAgentsAI-Crawler User-Agent. Common causes:
  • 403 Forbidden: Bot protection is blocking the crawler. See allowlisting above.
  • 429 Too Many Requests: Your server is rate-limiting. Should resolve on retry.
  • 500 Server Error: A server-side issue on the target site.
The crawler relies on sitemaps for discovery. Without a sitemap.xml, the crawler falls back to following links from the homepage, which may miss deeper pages. Adding a sitemap is the best way to ensure full coverage.
The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app, the crawler may see an empty shell. Use server-side rendering or a static HTML version.

Where to go next

URL Sources

Add a site to your knowledge base.

Knowledge Overview

The full knowledge ingestion pipeline.