Overview
When you add a URL Source, Halo uses a web crawler to fetch and process pages. This page covers the technical details. Useful if you need to allowlist the crawler in your firewall, CDN, or bot protection.Crawler identity
| Property | Value |
|---|---|
| User-Agent | HaloAgentsAI-Crawler/1.0 (+https://docs.haloagents.ai/knowledge/crawler) |
| Referer | https://docs.haloagents.ai/knowledge/crawler (never Google or other organic search URLs) |
| Trace header | X-Halo-Crawler: knowledge-ingest |
| Respects robots.txt | Yes |
| JavaScript rendering | Yes (via Firecrawl for knowledge URL sources) |
| Firecrawl proxy | basic only (unless FIRECRAWL_ALLOW_SPOOFING_PROXY=true) |
scrape / startCrawl call. We do not simulate Google organic traffic. Enhanced / stealth Firecrawl proxy modes are disabled by default because they can inject misleading browser fingerprints (including Referer: google.com).
Allowlisting the crawler
If your site uses bot protection (Cloudflare, Akamai, AWS WAF, Sucuri, etc.), you may need to allowlist the Halo crawler.By User-Agent
AddHaloAgentsAI-Crawler to your bot protection allowlist. Steps depend on your provider:
Cloudflare
Cloudflare
- Go to Security > WAF > Custom Rules
-
Create a new rule with the expression:
- Set the action to Skip (or Allow)
AWS WAF
AWS WAF
- Go to your Web ACL > Rules
- Add a rule that matches the
User-Agentheader containingHaloAgentsAI-Crawler - Set the action to Allow
Nginx
Nginx
Apache
Apache
Via robots.txt
The crawler respects standardrobots.txt rules. To explicitly allow Halo while blocking other bots:
Crawler behavior
Discovery and incremental sync
When you add a site, Halo runs a one-time Firecrawl crawl to ingest everything. After that, a daily sitemap sync keeps the knowledge base in sync without re-crawling pages that haven’t changed:- Fetch the site’s sitemap. Tried in order: the sitemap URL we last used,
/sitemap.xml,/sitemap_index.xml,/sitemap/sitemap.xml, thenSitemap:directives inrobots.txt. Sitemap index files are followed recursively. - Diff the URL list against what we already have:
- New URL in sitemap: insert a page row and scrape it.
<lastmod>newer than our last successful crawl: re-scrape and refresh the embeddings.- URL no longer in sitemap for 30+ days: soft-remove the page and delete its embeddings.
- Everything else: do nothing. No HTTP request to the page at all.
- Update the page’s
sitemap_lastmodandsitemap_last_seen_atso the next sync knows what’s actually changed.
Fallback for sites without a sitemap
If we can’t find a working sitemap (no/sitemap.xml, no robots.txt directive, or the file returns nothing parseable), the site falls back to a quarterly full Firecrawl crawl so new pages still surface. Most documentation hosts (Mintlify, Docusaurus, Intercom, Zendesk, GitBook, Notion-published sites) publish sitemaps automatically, so the fallback rarely fires.
Request pattern
| Property | Value |
|---|---|
| Daily sitemap sync | One sitemap.xml fetch per site, at 06:00 UTC |
| Page scrapes | Only when sitemap <lastmod> is newer than our last crawl, or the URL is new |
| Full re-discovery fallback | Every 90 days for sites without a working sitemap |
| Request rate | Up to 200 requests/min per domain |
| Concurrent requests | Up to 10 per organization |
| Request timeout | 15 seconds per page |
| Protocols | HTTPS and HTTP |
| Content types | HTML only |
What gets extracted
From each page:- Page text: clean body content with nav, headers, footers, and boilerplate removed
- Title and description: from meta tags
- Images: URLs, alt text, and surrounding context
- Videos: YouTube and Vimeo embeds, with automatic transcript fetching when available
Change detection (defense in depth)
Even when the sitemap signals a change, we hash the normalized content after extraction. If the hash matches what we already have, we skip re-embedding so cosmetic edits (build dates, copyright years, whitespace) don’t burn embedding tokens.Troubleshooting
Pages are showing as errors
Pages are showing as errors
Check your server logs for the
HaloAgentsAI-Crawler User-Agent. Common causes:- 403 Forbidden: Bot protection is blocking the crawler. See allowlisting above.
- 429 Too Many Requests: Your server is rate-limiting. Should resolve on retry.
- 500 Server Error: A server-side issue on the target site.
Not all pages are being discovered
Not all pages are being discovered
The crawler relies on sitemaps for discovery. Without a
sitemap.xml, the crawler falls back to following links from the homepage, which may miss deeper pages. Adding a sitemap is the best way to ensure full coverage.Content looks wrong or incomplete
Content looks wrong or incomplete
The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app, the crawler may see an empty shell. Use server-side rendering or a static HTML version.
Where to go next
URL Sources
Add a site to your knowledge base.
Knowledge Overview
The full knowledge ingestion pipeline.