Why Use Data Providers?
Data providers eliminate manual knowledge management by automating the process of connecting your AI to existing knowledge sources:- Automatic updates - Keep your AI’s knowledge fresh by scheduling regular syncs of documentation sites, wikis, and databases
- Reduced maintenance - No need to manually copy content into your AI - connect once and let it stay synchronized
- Version tracking - Review what changed between syncs with snapshot comparisons
- Audience targeting - Automatically scope knowledge to specific user segments
- Scalable knowledge - Process thousands of pages from multiple sources efficiently
How Data Providers Work
Provider Lifecycle
- Create - Configure a data provider with source details (URLs, credentials, filters)
- Sync - The provider fetches content and creates a snapshot of all discovered sources
- Process - Content is extracted, cleaned, and made available to your AI
- Schedule - Optionally set automatic syncs to keep knowledge current
- Monitor - Track sync status, view changes, and manage sources
Snapshots
Each sync creates a snapshot - a point-in-time record of all knowledge sources discovered by the provider. Snapshots let you:- Track changes - Compare snapshots to see what content was added, modified, or removed
- Review before deploying - Examine new sources before making them available to your AI
- Troubleshoot issues - Investigate when specific content was added or changed
- PENDING - Sync is in progress, content is being fetched
- COMPLETED - Sync finished successfully, sources are available
- FAILED - Sync encountered errors and needs attention
Provider Types
Web Crawler
Automatically crawl and extract content from websites, documentation sites, and help centers. Best for: Public documentation, help centers, blog articles, product pages How it works: Starting from seed URLs, the crawler discovers and processes pages according to your scope and filter rules.Configuration Options
Seed URLs The starting points for your crawl. These should be HTTPS URLs pointing to your main documentation or help center pages.-
Same Domain - Crawls all subdomains under the root domain
- Seed:
https://docs.example.com - Allows:
https://docs.example.com/api,https://blog.example.com - Blocks:
https://other-site.com
- Seed:
-
Same Hostname - Only crawls pages on the exact subdomain
- Seed:
https://docs.example.com - Allows:
https://docs.example.com/api - Blocks:
https://blog.example.com,https://docs.example.com:8080/api
- Seed:
-
Same Origin - Strictest setting, matches hostname, protocol, and port exactly
- Seed:
https://docs.example.com - Allows:
https://docs.example.com/api - Blocks:
http://docs.example.com/api,https://docs.example.com:8080/api
- Seed:
- Automatic - Intelligently decides whether JavaScript rendering is needed
- JavaScript - Fully renders pages with JavaScript (for SPAs and dynamic content)
- No JavaScript - Fetches raw HTML only (faster for static sites)
- URL Limit - Maximum pages to crawl (1-20,000)
- Concurrency Limit - Simultaneous requests (1-50) - use lower values to avoid overloading websites
-
Query Aware - Treat URLs with different query strings as unique pages
- When enabled:
page?view=listandpage?view=gridare different - When disabled: Both are treated as the same page
- When enabled:
-
Fragment Aware - Crawl URL fragments (anchors after #) as separate pages
- When enabled:
page#section1andpage#section2are different - When disabled: Both are treated as the same page
- When enabled:
When include-only selectors are specified, only matching elements are kept. Nested matches are deduplicated by keeping the outermost element.
Web Crawler Example
Crawling a documentation site with specific requirements:- Navigate to Data Providers > Add Data Provider
- Select Web Crawler as the provider type
- Configure:
- Save and Sync - The crawler will begin discovering and processing pages
- Monitor - View the snapshot timeline to track progress
Collection
Manually create and manage knowledge snippets for content that doesn’t exist in crawlable sources. Best for: Internal procedures, FAQs, policy documents, quick knowledge additions How it works: Create a collection as a container, then add individual snippets of HTML content through the UI or API.Creating a Collection
- Navigate to Data Providers > Add Data Provider
- Select Collection as the provider type
- Configure:
- Save - Your collection is ready for content
Adding Snippets
After creating a collection, add knowledge snippets:- Navigate to the collection detail page
- Click Add Snippet
- Provide:
- Name - Descriptive title for the snippet
- Content - HTML content (rich text supported)
- Audience - Optional audience to scope this specific snippet
- Save - The snippet is immediately available to your AI
Collections don’t use snapshots or scheduling since content is added manually. Each snippet becomes a source immediately upon creation.
Collection Example
Creating a collection for internal HR policies:Confluence
Connect to Atlassian Confluence spaces to automatically sync documentation and wiki content. Best for: Confluence Cloud workspaces, internal documentation wikis, team knowledge bases Status: Configuration interface coming soonThe Confluence provider is currently in development. Contact support@botbrains.io if you need Confluence integration for your project.
Scheduling Automatic Syncs
Keep your AI’s knowledge fresh by scheduling regular syncs for web crawlers and other automated providers.Setting a Schedule
When creating or editing a provider:- Locate the Schedule section
- Select days to run syncs (Monday through Sunday)
- Save - A random time between 1-5 AM will be assigned automatically
Schedule Examples
Daily Documentation UpdatesDisabling Schedules
To stop automatic syncs:- Edit the data provider
- Deselect all days in the schedule section
- Save - No automatic syncs will occur
Managing Snapshots
Viewing Snapshots
Each provider’s detail page shows a timeline of snapshots:- Navigate to the data provider
- View the snapshot timeline showing sync history
- Click a snapshot to expand and view details
Comparing Snapshots
See what changed between syncs:- Select a snapshot from the timeline
- View the automatic comparison with the previous snapshot
- Review:
- Added - New sources discovered
- Modified - Existing sources with content changes
- Removed - Sources no longer found
The first snapshot for a provider won’t show a comparison since there’s no previous snapshot to compare against.
Triggering Manual Syncs
Force a sync outside the scheduled times:- Navigate to the data provider detail page
- Click Sync Now
- Monitor the new snapshot’s progress in the timeline
Assigning Audiences to Providers
Control which users can access knowledge from a provider by assigning default audiences.How It Works
When you assign an audience to a data provider:- All new sources discovered by that provider automatically inherit the audience assignment
- Only users matching the audience criteria can see those sources in AI responses
- Existing sources from previous syncs are not affected - only new sources from future syncs
Setting a Default Audience
- Navigate to Data Providers > Edit Provider
- Select an audience from the Audience dropdown
- Save - Future syncs will assign this audience to new sources
Use Cases
Customer DocumentationBest Practices
Start with Clear Goals
Define what knowledge your AI needs before configuring providers. Focus on high-value content that answers common customer questions.
Test Crawl Configuration
Use small URL limits initially to verify your include/exclude patterns and selectors capture the right content before running full syncs.
Schedule Strategically
Consider how often your source content changes. Daily documentation sites need frequent syncs, while stable FAQs might only need weekly updates.
Monitor Snapshot Changes
Regularly review snapshot diffs to catch unintended changes, removed pages, or new content that needs attention.
Use Descriptive Names
Name providers clearly to indicate their purpose and content (e.g., “Product Docs - API Reference” vs. “Provider 1”).
Leverage Audiences
Assign appropriate audiences to providers to ensure users only see relevant knowledge for their segment or access level.
Troubleshooting
Snapshot Stuck in PENDING
Symptoms: Snapshot shows PENDING status for extended periods Solutions:- Check that seed URLs are accessible (not behind authentication)
- Verify URL patterns aren’t too broad, causing excessive crawling
- Reduce concurrency limit if the target site is rate-limiting requests
- Contact support if the issue persists after 30 minutes
Missing Pages in Snapshot
Symptoms: Expected pages aren’t appearing in snapshot sources Solutions:- Verify pages are linked from seed URLs (crawler follows links)
- Check include/exclude patterns aren’t blocking desired pages
- Ensure crawl scope allows the page’s domain/hostname
- Review URL limit - you may need to increase it
- For JavaScript-heavy sites, try “JavaScript” render mode
Too Many Irrelevant Pages
Symptoms: Snapshot includes unwanted pages or sections Solutions:- Add exclude patterns for unwanted URL paths
- Use exclude selectors to remove navigation, footers, and ads
- Tighten crawl scope to “Same Hostname” or “Same Origin”
- Reduce URL limit to focus on most important pages
- Consider using include patterns to allowlist specific paths
Duplicate Content
Symptoms: Same content appearing multiple times from different URLs Solutions:- Disable “Query Aware” if query parameters don’t change content
- Disable “Fragment Aware” if fragments don’t represent unique pages
- Add exclude patterns for known duplicate URL patterns
- Use include-only selectors to extract just the main content area
API Integration
Data providers can be managed programmatically via the botBrains API: Create a ProviderNext Steps
After setting up data providers:- Create Snippets - Add supplementary knowledge for specific topics
- Configure Tables - Structure data for precise lookups and calculations
- Instruct Your AI - Define how your AI uses this knowledge
- Review Conversations - Monitor how your AI answers questions with this knowledge