When to use URL sources
URL sources are the fastest way to give your AI hundreds of articles in minutes. If your support docs already live on the public web — a help center, docs site, marketing pages, blog posts — point Halo at the root URL and it ingests everything automatically.Adding a site
From Knowledge > URL Sources in your dashboard:- Enter a domain or URL (e.g.
help.example.comorhttps://docs.example.com/guides/setup) - Click Add Site
Site roots
When you add a domain or root URL, Halo:- Discovers pages by checking
sitemap.xml,robots.txtsitemap directives, then crawling internal links - Creates page entries for every discovered URL
- Crawls each page in parallel — extracting content, images, and video transcripts
- Ingests the content into your knowledge base with vector embeddings
Single pages
When you add a URL with a specific path, Halo crawls only that page. Useful for adding individual blog posts or articles that aren’t part of a larger site you want fully indexed.Monitoring progress
The URL Sources table shows real-time status for each site:| Column | Description |
|---|---|
| Status | Discovering (finding pages), Crawling (processing), Complete, or Error |
| Progress | Pages completed / pages discovered, with error count |
| Last Crawled | When the site was last fully crawled |
Recrawling
Halo keeps your knowledge base in sync via a daily sitemap diff, not blind full-site recrawls:- Each day we fetch your site’s
sitemap.xml. - New URLs are picked up and scraped automatically.
- URLs whose
<lastmod>is newer than our last successful crawl are re-scraped, and their embeddings are refreshed. - URLs that disappear from the sitemap for 30 days are removed (their embeddings deleted).
- URLs whose
<lastmod>hasn’t changed are left alone. No request, no embedding spend.
Deleting a site
Click the trash icon to remove a site. Effects:- Cancels any in-progress crawl
- Removes all page entries for that site
- Deletes the embeddings from your knowledge base
- Stops future scheduled recrawls
Tips
Help centers work best
Help centers work best
Sites like
help.example.com, docs.example.com, or support.example.com are ideal candidates. They typically have well-structured content with sitemaps, making discovery fast and thorough.Non-HTML content is skipped
Non-HTML content is skipped
The crawler only processes HTML pages. PDFs, images, and other file types are skipped. To index PDFs, upload them via Files & Internal Docs instead.
Bot protection may block the crawler
Bot protection may block the crawler
If your site uses Cloudflare, Akamai, AWS WAF, Sucuri, or similar, you may need to allowlist the Halo crawler. See Web Crawler.
JavaScript-rendered content won't be picked up
JavaScript-rendered content won't be picked up
The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app that renders content client-side, the crawler may see an empty page. Server-side rendering or a static HTML version is required.
Where to go next
Web Crawler
User-Agent, allowlisting, and detailed crawler behavior.
Files & Internal Docs
Add content that isn’t available at a public URL.