Web Crawler

Overview

When you add a URL Source, Halo uses a web crawler to fetch and process pages. This page covers the technical details. Useful if you need to allowlist the crawler in your firewall, CDN, or bot protection.

Crawler identity

Property	Value
User-Agent	`HaloAgentsAI-Crawler/1.0 (+https://docs.haloagents.ai/knowledge/crawler)`
Referer	`https://docs.haloagents.ai/knowledge/crawler` (never Google or other organic search URLs)
Trace header	`X-Halo-Crawler: knowledge-ingest`
Respects robots.txt	Yes
JavaScript rendering	No. Pages are fetched as raw HTML, so page scripts never execute
Analytics impact	None. Because no JavaScript runs, crawls never fire Google Analytics, Tag Manager, or other tracking pixels on your site
Firecrawl proxy	`basic` only (unless `FIRECRAWL_ALLOW_SPOOFING_PROXY=true`)

Every outbound knowledge crawl uses the headers above on direct fetches and on every Firecrawl scrape / startCrawl call. We do not simulate Google organic traffic. Enhanced / stealth Firecrawl proxy modes are disabled by default because they can inject misleading browser fingerprints (including Referer: google.com). JavaScript rendering is disabled by default (FIRECRAWL_ALLOW_JS_RENDERING opt-in only) so a crawl can never register fake pageviews in a site’s analytics.

Allowlisting the crawler

If your site uses bot protection (Cloudflare, Akamai, AWS WAF, Sucuri, etc.), you may need to allowlist the Halo crawler.

By User-Agent

Add HaloAgentsAI-Crawler to your bot protection allowlist. Steps depend on your provider:

Cloudflare

Go to Security > WAF > Custom Rules

Create a new rule with the expression:

(http.user_agent contains "HaloAgentsAI-Crawler")

Set the action to Skip (or Allow)

AWS WAF

Go to your Web ACL > Rules
Add a rule that matches the User-Agent header containing HaloAgentsAI-Crawler
Set the action to Allow

Nginx

if ($http_user_agent ~* "HaloAgentsAI-Crawler") {
  set $is_bot 0;
}

Apache

SetEnvIf User-Agent "HaloAgentsAI-Crawler" allowed_bot
Order Deny,Allow
Deny from all
Allow from env=allowed_bot

Via robots.txt

The crawler respects standard robots.txt rules. To explicitly allow Halo while blocking other bots:

User-agent: HaloAgentsAI-Crawler
Allow: /

User-agent: *
Disallow: /private/

To block specific paths:

User-agent: HaloAgentsAI-Crawler
Disallow: /admin/
Disallow: /internal/
Allow: /

Crawler behavior

Discovery and incremental sync

When you add a site, Halo runs a one-time Firecrawl crawl to ingest everything. After that, a biweekly sitemap sync keeps the knowledge base in sync without re-crawling pages that haven’t changed. Need fresher content after publishing? Use the Resync button on the URL source, which runs the same efficient diff on demand:

Fetch the site’s sitemap. Tried in order: the sitemap URL we last used, /sitemap.xml, /sitemap_index.xml, /sitemap/sitemap.xml, then Sitemap: directives in robots.txt. Sitemap index files are followed recursively.
Diff the URL list against what we already have:
- New URL in sitemap: insert a page row and scrape it.
- <lastmod> newer than our last successful crawl: re-scrape and refresh the embeddings.
- URL no longer in sitemap for 30+ days: soft-remove the page and delete its embeddings.
- Everything else: do nothing. No HTTP request to the page at all.
Update the page’s sitemap_lastmod and sitemap_last_seen_at so the next sync knows what’s actually changed.

This means a typical two-week window generates exactly one sitemap fetch per site, plus Firecrawl scrapes only for pages whose publisher signalled a change.

Fallback for sites without a sitemap

If we can’t find a working sitemap (no /sitemap.xml, no robots.txt directive, or the file returns nothing parseable), the site falls back to a quarterly full Firecrawl crawl so new pages still surface. Most documentation hosts (Mintlify, Docusaurus, Intercom, Zendesk, GitBook, Notion-published sites) publish sitemaps automatically, so the fallback rarely fires.

Request pattern

Property	Value
Scheduled sitemap sync	One sitemap.xml fetch per site, once every two weeks
Manual resync	On demand via the Resync button; same sitemap diff, so unchanged pages still cost zero requests
Page scrapes	Only when sitemap `<lastmod>` is newer than our last crawl, or the URL is new
Full re-discovery fallback	Every 90 days for sites without a working sitemap
Request rate	Up to 200 requests/min per domain
Concurrent requests	Up to 10 per organization
Request timeout	15 seconds per page
Protocols	HTTPS and HTTP
Content types	HTML only

The crawler is designed to be respectful: rate-limited per domain, concurrency-capped per org, and driven by your sitemap so we never re-fetch content that hasn’t changed.

What gets extracted

From each page:

Page text: clean body content with nav, headers, footers, and boilerplate removed
Title and description: from meta tags
Images: URLs, alt text, and surrounding context
Videos: YouTube and Vimeo embeds, with automatic transcript fetching when available

Change detection (defense in depth)

Even when the sitemap signals a change, we hash the normalized content after extraction. If the hash matches what we already have, we skip re-embedding so cosmetic edits (build dates, copyright years, whitespace) don’t burn embedding tokens.

Troubleshooting

Pages are showing as errors

Check your server logs for the HaloAgentsAI-Crawler User-Agent. Common causes:

403 Forbidden: Bot protection is blocking the crawler. See allowlisting above.
429 Too Many Requests: Your server is rate-limiting. Should resolve on retry.
500 Server Error: A server-side issue on the target site.

Not all pages are being discovered

The crawler relies on sitemaps for discovery. Without a sitemap.xml, the crawler falls back to following links from the homepage, which may miss deeper pages. Adding a sitemap is the best way to ensure full coverage.

Content looks wrong or incomplete

The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app, the crawler may see an empty shell. Use server-side rendering or a static HTML version.

Getting Started

Web Widget

AI Agents

Ask AI

Inbox

Knowledge

Channels

Help Center

Contacts & Companies

Outreach

Settings

Advanced

Overview

Crawler identity

Allowlisting the crawler

By User-Agent

Via robots.txt

Crawler behavior

Discovery and incremental sync

Fallback for sites without a sitemap

Request pattern

What gets extracted

Change detection (defense in depth)

Troubleshooting

Where to go next

URL Sources

Knowledge Overview

​Overview

​Crawler identity

​Allowlisting the crawler

​By User-Agent

​Via robots.txt

​Crawler behavior

​Discovery and incremental sync

​Fallback for sites without a sitemap

​Request pattern

​What gets extracted

​Change detection (defense in depth)

​Troubleshooting

​Where to go next

URL Sources

Knowledge Overview

Overview

Crawler identity

Allowlisting the crawler

By User-Agent

Via robots.txt

Crawler behavior

Discovery and incremental sync

Fallback for sites without a sitemap

Request pattern

What gets extracted

Change detection (defense in depth)

Troubleshooting

Where to go next