Skip to main content
Data providers are the foundation of your AI’s knowledge - they automatically gather, update, and organize information from various sources to ensure your AI has accurate, up-to-date answers for your customers.

Why Use Data Providers?

Data providers eliminate manual knowledge management by automating the process of connecting your AI to existing knowledge sources:
  • Automatic updates - Keep your AI’s knowledge fresh by scheduling regular syncs of documentation sites, wikis, and databases
  • Reduced maintenance - No need to manually copy content into your AI - connect once and let it stay synchronized
  • Version tracking - Review what changed between syncs with snapshot comparisons
  • Audience targeting - Automatically scope knowledge to specific user segments
  • Scalable knowledge - Process thousands of pages from multiple sources efficiently

How Data Providers Work

Provider Lifecycle

  1. Create - Configure a data provider with source details (URLs, credentials, filters)
  2. Sync - The provider fetches content and creates a snapshot of all discovered sources
  3. Process - Content is extracted, cleaned, and made available to your AI
  4. Schedule - Optionally set automatic syncs to keep knowledge current
  5. Monitor - Track sync status, view changes, and manage sources

Snapshots

Each sync creates a snapshot - a point-in-time record of all knowledge sources discovered by the provider. Snapshots let you:
  • Track changes - Compare snapshots to see what content was added, modified, or removed
  • Review before deploying - Examine new sources before making them available to your AI
  • Troubleshoot issues - Investigate when specific content was added or changed
Snapshots progress through three states:
  • PENDING - Sync is in progress, content is being fetched
  • COMPLETED - Sync finished successfully, sources are available
  • FAILED - Sync encountered errors and needs attention

Provider Types

Web Crawler

Automatically crawl and extract content from websites, documentation sites, and help centers. Best for: Public documentation, help centers, blog articles, product pages How it works: Starting from seed URLs, the crawler discovers and processes pages according to your scope and filter rules.

Configuration Options

Seed URLs The starting points for your crawl. These should be HTTPS URLs pointing to your main documentation or help center pages.
https://docs.yourcompany.com
https://help.yourcompany.com/getting-started
Crawl Scope Controls which pages the crawler can visit based on the relationship to seed URLs:
  • Same Domain - Crawls all subdomains under the root domain
    • Seed: https://docs.example.com
    • Allows: https://docs.example.com/api, https://blog.example.com
    • Blocks: https://other-site.com
  • Same Hostname - Only crawls pages on the exact subdomain
    • Seed: https://docs.example.com
    • Allows: https://docs.example.com/api
    • Blocks: https://blog.example.com, https://docs.example.com:8080/api
  • Same Origin - Strictest setting, matches hostname, protocol, and port exactly
    • Seed: https://docs.example.com
    • Allows: https://docs.example.com/api
    • Blocks: http://docs.example.com/api, https://docs.example.com:8080/api
Render Mode Determines how pages are processed:
  • Automatic - Intelligently decides whether JavaScript rendering is needed
  • JavaScript - Fully renders pages with JavaScript (for SPAs and dynamic content)
  • No JavaScript - Fetches raw HTML only (faster for static sites)
URL Limits Control crawl size and concurrency:
  • URL Limit - Maximum pages to crawl (1-20,000)
  • Concurrency Limit - Simultaneous requests (1-50) - use lower values to avoid overloading websites
URL Variants Control how the crawler treats URL variations:
  • Query Aware - Treat URLs with different query strings as unique pages
    • When enabled: page?view=list and page?view=grid are different
    • When disabled: Both are treated as the same page
  • Fragment Aware - Crawl URL fragments (anchors after #) as separate pages
    • When enabled: page#section1 and page#section2 are different
    • When disabled: Both are treated as the same page
Include/Exclude Patterns Filter pages using glob patterns:
Include Patterns:
https://docs.example.com/api/*
https://docs.example.com/guides/*

Exclude Patterns:
https://docs.example.com/internal/*
https://docs.example.com/*/deprecated
Include/Exclude Selectors Use CSS selectors to filter page content:
Include Only Selectors:
.documentation-content
article.help-article

Exclude Selectors:
.navigation
.footer
.advertisements
When include-only selectors are specified, only matching elements are kept. Nested matches are deduplicated by keeping the outermost element.

Web Crawler Example

Crawling a documentation site with specific requirements:
  1. Navigate to Data Providers > Add Data Provider
  2. Select Web Crawler as the provider type
  3. Configure:
    Name: Product Documentation
    Audience: Everyone (or select specific audience)
    
    Seed URLs:
    - https://docs.yourcompany.com
    
    Crawl Scope: Same Domain
    Render Mode: Automatic
    URL Limit: 500
    Concurrency Limit: 10
    
    Include Patterns:
    - https://docs.yourcompany.com/guides/*
    - https://docs.yourcompany.com/api/*
    
    Exclude Patterns:
    - https://docs.yourcompany.com/internal/*
    
    Exclude Selectors:
    - .sidebar
    - .related-articles
    
  4. Save and Sync - The crawler will begin discovering and processing pages
  5. Monitor - View the snapshot timeline to track progress
Start with a small URL limit (50-100 pages) for your first sync to verify the configuration captures the right content, then increase the limit as needed.

Collection

Manually create and manage knowledge snippets for content that doesn’t exist in crawlable sources. Best for: Internal procedures, FAQs, policy documents, quick knowledge additions How it works: Create a collection as a container, then add individual snippets of HTML content through the UI or API.

Creating a Collection

  1. Navigate to Data Providers > Add Data Provider
  2. Select Collection as the provider type
  3. Configure:
    Name: Internal Policies
    Audience: Internal Team (optional)
    
  4. Save - Your collection is ready for content

Adding Snippets

After creating a collection, add knowledge snippets:
  1. Navigate to the collection detail page
  2. Click Add Snippet
  3. Provide:
    • Name - Descriptive title for the snippet
    • Content - HTML content (rich text supported)
    • Audience - Optional audience to scope this specific snippet
  4. Save - The snippet is immediately available to your AI
Collections don’t use snapshots or scheduling since content is added manually. Each snippet becomes a source immediately upon creation.

Collection Example

Creating a collection for internal HR policies:
Name: HR Policies

Snippets:
1. PTO Policy
   Content: <HTML content explaining PTO accrual, request process>
   Audience: Employees

2. Remote Work Guidelines
   Content: <HTML content with remote work requirements>
   Audience: Employees

3. Expense Reimbursement
   Content: <HTML content detailing expense submission process>
   Audience: Everyone

Confluence

Connect to Atlassian Confluence spaces to automatically sync documentation and wiki content. Best for: Confluence Cloud workspaces, internal documentation wikis, team knowledge bases Status: Configuration interface coming soon
The Confluence provider is currently in development. Contact support@botbrains.io if you need Confluence integration for your project.

Scheduling Automatic Syncs

Keep your AI’s knowledge fresh by scheduling regular syncs for web crawlers and other automated providers.

Setting a Schedule

When creating or editing a provider:
  1. Locate the Schedule section
  2. Select days to run syncs (Monday through Sunday)
  3. Save - A random time between 1-5 AM will be assigned automatically
Random sync times prevent all providers from running simultaneously, which helps distribute server load and ensures reliable syncing.

Schedule Examples

Daily Documentation Updates
Days: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
Result: Syncs every day at the assigned time
Weekly Knowledge Refresh
Days: Sunday
Result: Syncs once per week on Sunday
Business Days Only
Days: Monday, Tuesday, Wednesday, Thursday, Friday
Result: Syncs on weekdays only

Disabling Schedules

To stop automatic syncs:
  1. Edit the data provider
  2. Deselect all days in the schedule section
  3. Save - No automatic syncs will occur
You can still trigger manual syncs from the provider detail page.

Managing Snapshots

Viewing Snapshots

Each provider’s detail page shows a timeline of snapshots:
  1. Navigate to the data provider
  2. View the snapshot timeline showing sync history
  3. Click a snapshot to expand and view details

Comparing Snapshots

See what changed between syncs:
  1. Select a snapshot from the timeline
  2. View the automatic comparison with the previous snapshot
  3. Review:
    • Added - New sources discovered
    • Modified - Existing sources with content changes
    • Removed - Sources no longer found
The first snapshot for a provider won’t show a comparison since there’s no previous snapshot to compare against.

Triggering Manual Syncs

Force a sync outside the scheduled times:
  1. Navigate to the data provider detail page
  2. Click Sync Now
  3. Monitor the new snapshot’s progress in the timeline

Assigning Audiences to Providers

Control which users can access knowledge from a provider by assigning default audiences.

How It Works

When you assign an audience to a data provider:
  • All new sources discovered by that provider automatically inherit the audience assignment
  • Only users matching the audience criteria can see those sources in AI responses
  • Existing sources from previous syncs are not affected - only new sources from future syncs

Setting a Default Audience

  1. Navigate to Data Providers > Edit Provider
  2. Select an audience from the Audience dropdown
  3. Save - Future syncs will assign this audience to new sources

Use Cases

Customer Documentation
Provider: Public Help Center
Audience: Everyone
Result: All customers see this knowledge
Enterprise Features
Provider: Enterprise Documentation
Audience: Enterprise Customers
Result: Only enterprise tier users see advanced features
Internal Knowledge
Provider: Internal Procedures Wiki
Audience: Internal Team
Result: Only team members access internal documentation
You can override the provider-level audience assignment for individual sources after they’re created. Navigate to the source detail page to change its audience.

Best Practices

Start with Clear Goals

Define what knowledge your AI needs before configuring providers. Focus on high-value content that answers common customer questions.

Test Crawl Configuration

Use small URL limits initially to verify your include/exclude patterns and selectors capture the right content before running full syncs.

Schedule Strategically

Consider how often your source content changes. Daily documentation sites need frequent syncs, while stable FAQs might only need weekly updates.

Monitor Snapshot Changes

Regularly review snapshot diffs to catch unintended changes, removed pages, or new content that needs attention.

Use Descriptive Names

Name providers clearly to indicate their purpose and content (e.g., “Product Docs - API Reference” vs. “Provider 1”).

Leverage Audiences

Assign appropriate audiences to providers to ensure users only see relevant knowledge for their segment or access level.

Troubleshooting

Snapshot Stuck in PENDING

Symptoms: Snapshot shows PENDING status for extended periods Solutions:
  • Check that seed URLs are accessible (not behind authentication)
  • Verify URL patterns aren’t too broad, causing excessive crawling
  • Reduce concurrency limit if the target site is rate-limiting requests
  • Contact support if the issue persists after 30 minutes

Missing Pages in Snapshot

Symptoms: Expected pages aren’t appearing in snapshot sources Solutions:
  • Verify pages are linked from seed URLs (crawler follows links)
  • Check include/exclude patterns aren’t blocking desired pages
  • Ensure crawl scope allows the page’s domain/hostname
  • Review URL limit - you may need to increase it
  • For JavaScript-heavy sites, try “JavaScript” render mode

Too Many Irrelevant Pages

Symptoms: Snapshot includes unwanted pages or sections Solutions:
  • Add exclude patterns for unwanted URL paths
  • Use exclude selectors to remove navigation, footers, and ads
  • Tighten crawl scope to “Same Hostname” or “Same Origin”
  • Reduce URL limit to focus on most important pages
  • Consider using include patterns to allowlist specific paths

Duplicate Content

Symptoms: Same content appearing multiple times from different URLs Solutions:
  • Disable “Query Aware” if query parameters don’t change content
  • Disable “Fragment Aware” if fragments don’t represent unique pages
  • Add exclude patterns for known duplicate URL patterns
  • Use include-only selectors to extract just the main content area

API Integration

Data providers can be managed programmatically via the botBrains API: Create a Provider
POST /api/v1/projects/{project_id}/data-providers
Trigger a Sync
POST /api/v1/projects/{project_id}/data-providers/{provider_id}/snapshots
List Snapshots
GET /api/v1/projects/{project_id}/data-providers/{provider_id}/snapshots
Add Snippet to Collection
POST /api/v1/projects/{project_id}/data-providers/{provider_id}/snippets
For complete API documentation, see the API Reference.

Next Steps

After setting up data providers: