Skip to main content

When to use URL sources

URL sources are the fastest way to give your AI hundreds of articles in minutes. If your support docs already live on the public web — a help center, docs site, marketing pages, blog posts — point Halo at the root URL and it ingests everything automatically.

Adding a site

From Knowledge > URL Sources in your dashboard:
  1. Enter a domain or URL (e.g. help.example.com or https://docs.example.com/guides/setup)
  2. Click Add Site
Halo automatically detects whether you’ve entered a site root or a specific page:

Site roots

When you add a domain or root URL, Halo:
  1. Discovers pages by checking sitemap.xml, robots.txt sitemap directives, then crawling internal links
  2. Creates page entries for every discovered URL
  3. Crawls each page in parallel — extracting content, images, and video transcripts
  4. Ingests the content into your knowledge base with vector embeddings
Recommended for help centers, documentation sites, and knowledge bases.

Single pages

When you add a URL with a specific path, Halo crawls only that page. Useful for adding individual blog posts or articles that aren’t part of a larger site you want fully indexed.

Monitoring progress

The URL Sources table shows real-time status for each site:
ColumnDescription
StatusDiscovering (finding pages), Crawling (processing), Complete, or Error
ProgressPages completed / pages discovered, with error count
Last CrawledWhen the site was last fully crawled
Expand a site row to see the status of individual pages.

Recrawling

Halo keeps your knowledge base in sync via a daily sitemap diff, not blind full-site recrawls:
  • Each day we fetch your site’s sitemap.xml.
  • New URLs are picked up and scraped automatically.
  • URLs whose <lastmod> is newer than our last successful crawl are re-scraped, and their embeddings are refreshed.
  • URLs that disappear from the sitemap for 30 days are removed (their embeddings deleted).
  • URLs whose <lastmod> hasn’t changed are left alone. No request, no embedding spend.
For sites without a working sitemap, Halo falls back to a full Firecrawl re-discovery every 90 days so new pages still surface. Click the refresh icon on any site to trigger an immediate full recrawl any time, for example after a big launch.

Deleting a site

Click the trash icon to remove a site. Effects:
  • Cancels any in-progress crawl
  • Removes all page entries for that site
  • Deletes the embeddings from your knowledge base
  • Stops future scheduled recrawls

Tips

Sites like help.example.com, docs.example.com, or support.example.com are ideal candidates. They typically have well-structured content with sitemaps, making discovery fast and thorough.
The crawler only processes HTML pages. PDFs, images, and other file types are skipped. To index PDFs, upload them via Files & Internal Docs instead.
If your site uses Cloudflare, Akamai, AWS WAF, Sucuri, or similar, you may need to allowlist the Halo crawler. See Web Crawler.
The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app that renders content client-side, the crawler may see an empty page. Server-side rendering or a static HTML version is required.

Where to go next

Web Crawler

User-Agent, allowlisting, and detailed crawler behavior.

Files & Internal Docs

Add content that isn’t available at a public URL.