URL Sources

When to use URL sources

URL sources are the fastest way to give your AI hundreds of articles in minutes. If your support docs already live on the public web — a help center, docs site, marketing pages, blog posts — point Halo at the root URL and it ingests everything automatically.

Adding a site

From Knowledge > URL Sources in your dashboard:

Enter a domain or URL (e.g. help.example.com or https://docs.example.com/guides/setup)
Click Add Site

Halo automatically detects whether you’ve entered a site root or a specific page:

Site roots

When you add a domain or root URL, Halo:

Discovers pages by checking sitemap.xml, robots.txt sitemap directives, then crawling internal links
Creates page entries for every discovered URL
Crawls each page in parallel — extracting content, images, and video transcripts
Ingests the content into your knowledge base with vector embeddings

Recommended for help centers, documentation sites, and knowledge bases.

Single pages

When you add a URL with a specific path, Halo crawls only that page. Useful for adding individual blog posts or articles that aren’t part of a larger site you want fully indexed. One exception: URLs on help center style subdomains (help., docs., support., kb., and similar) are always treated as sites, even with a deep path. Pasting a single help article ingests the whole help center. Halo keeps one site entry per domain, so adding multiple deep links from the same help center never crawls the domain twice.

Monitoring progress

The URL Sources table shows real-time status for each site:

Column	Description
Status	`Discovering` (finding pages), `Crawling` (processing), `Complete`, or `Error`
Progress	Pages completed / pages discovered, with error count
Last Crawled	When the site was last fully crawled

Expand a site row to see the status of individual pages.

Recrawling

Halo keeps your knowledge base in sync via a daily sitemap diff, not blind full-site recrawls:

Each day we fetch your site’s sitemap.xml.
New URLs are picked up and scraped automatically.
URLs whose <lastmod> is newer than our last successful crawl are re-scraped, and their embeddings are refreshed.
URLs that disappear from the sitemap for 30 days are removed (their embeddings deleted).
URLs whose <lastmod> hasn’t changed are left alone. No request, no embedding spend.

For sites without a working sitemap, Halo falls back to a full Firecrawl re-discovery every 90 days so new pages still surface. Click the refresh icon on any site to trigger an immediate full recrawl any time, for example after a big launch.

Deleting a site

Click the trash icon to remove a site. Effects:

Cancels any in-progress crawl
Removes all page entries for that site
Deletes the embeddings from your knowledge base
Stops future scheduled recrawls

Tips

Help centers work best

Sites like help.example.com, docs.example.com, or support.example.com are ideal candidates. They typically have well-structured content with sitemaps, making discovery fast and thorough.

Non-HTML content is skipped

The crawler only processes HTML pages. PDFs, images, and other file types are skipped. To index PDFs, upload them via Files & Internal Docs instead.

Bot protection may block the crawler

If your site uses Cloudflare, Akamai, AWS WAF, Sucuri, or similar, you may need to allowlist the Halo crawler. See Web Crawler.

JavaScript-rendered content won't be picked up

The crawler reads raw HTML and doesn’t execute JavaScript. If your site is a single-page app that renders content client-side, the crawler may see an empty page. Server-side rendering or a static HTML version is required.

Getting Started

Web Widget

AI Agents

Ask AI

Inbox

Knowledge

Channels

Help Center

Contacts & Companies

Outreach

Settings

Advanced

When to use URL sources

Adding a site

Site roots

Single pages

Monitoring progress

Recrawling

Deleting a site

Tips

Where to go next

Web Crawler

Files & Internal Docs

​When to use URL sources

​Adding a site

​Site roots

​Single pages

​Monitoring progress

​Recrawling

​Deleting a site

​Tips

​Where to go next

Web Crawler

Files & Internal Docs

When to use URL sources

Adding a site

Site roots

Single pages

Monitoring progress

Recrawling

Deleting a site

Tips

Where to go next