Data Providers - botBrains Docs

Data providers are the foundation of your AI’s knowledge - they automatically gather, update, and organize information from various sources to ensure your AI has accurate, up-to-date answers for your customers.

Why Use Data Providers?

Data providers eliminate manual knowledge management by automating the process of connecting your AI to existing knowledge sources:

Automatic updates - Keep your AI’s knowledge fresh by scheduling regular syncs of documentation sites, wikis, and databases
Reduced maintenance - No need to manually copy content into your AI - connect once and let it stay synchronized
Version tracking - Review what changed between syncs with snapshot comparisons
Audience targeting - Automatically scope knowledge to specific user segments
Scalable knowledge - Process thousands of pages from multiple sources efficiently

How Data Providers Work

Provider Lifecycle

Create - Configure a data provider with source details (URLs, credentials, filters)
Sync - The provider fetches content and creates a snapshot of all discovered sources
Process - Content is extracted, cleaned, and made available to your AI
Schedule - Optionally set automatic syncs to keep knowledge current
Monitor - Track sync status, view changes, and manage sources

Snapshots

Each sync creates a snapshot - a point-in-time record of all knowledge sources discovered by the provider. Snapshots let you:

Track changes - Compare snapshots to see what content was added, modified, or removed
Review before deploying - Examine new sources before making them available to your AI
Troubleshoot issues - Investigate when specific content was added or changed

Snapshots progress through three states:

PENDING - Sync is in progress, content is being fetched
COMPLETED - Sync finished successfully, sources are available
FAILED - Sync encountered errors and needs attention

Provider Types

Web Crawler

Automatically crawl and extract content from websites, documentation sites, and help centers. Best for: Public documentation, help centers, blog articles, product pages How it works: Starting from seed URLs, the crawler discovers and processes pages according to your scope and filter rules.

Configuration Options

Seed URLs The starting points for your crawl. These should be HTTPS URLs pointing to your main documentation or help center pages.

https://docs.yourcompany.com
https://help.yourcompany.com/getting-started

Crawl Scope Controls which pages the crawler can visit based on the relationship to seed URLs:

Same Domain - Crawls all subdomains under the root domain
- Seed: https://docs.example.com
- Allows: https://docs.example.com/api, https://blog.example.com
- Blocks: https://other-site.com
Same Hostname - Only crawls pages on the exact subdomain
- Seed: https://docs.example.com
- Allows: https://docs.example.com/api
- Blocks: https://blog.example.com, https://docs.example.com:8080/api
Same Origin - Strictest setting, matches hostname, protocol, and port exactly
- Seed: https://docs.example.com
- Allows: https://docs.example.com/api
- Blocks: http://docs.example.com/api, https://docs.example.com:8080/api

Render Mode Determines how pages are processed:

Automatic - Intelligently decides whether JavaScript rendering is needed
JavaScript - Fully renders pages with JavaScript (for SPAs and dynamic content)
No JavaScript - Fetches raw HTML only (faster for static sites)

URL Limits Control crawl size and concurrency:

URL Limit - Maximum pages to crawl (1-20,000)
Concurrency Limit - Simultaneous requests (1-50) - use lower values to avoid overloading websites

URL Variants Control how the crawler treats URL variations:

Query Aware - Treat URLs with different query strings as unique pages
- When enabled: page?view=list and page?view=grid are different
- When disabled: Both are treated as the same page
Fragment Aware - Crawl URL fragments (anchors after #) as separate pages
- When enabled: page#section1 and page#section2 are different
- When disabled: Both are treated as the same page

Include/Exclude Patterns Filter pages using glob patterns:

Include Patterns:
https://docs.example.com/api/*
https://docs.example.com/guides/*

Exclude Patterns:
https://docs.example.com/internal/*
https://docs.example.com/*/deprecated

Include/Exclude Selectors Use CSS selectors to filter page content:

Include Only Selectors:
.documentation-content
article.help-article

Exclude Selectors:
.navigation
.footer
.advertisements

When include-only selectors are specified, only matching elements are kept. Nested matches are deduplicated by keeping the outermost element.

Web Crawler Example

Crawling a documentation site with specific requirements:

Navigate to Data Providers > Add Data Provider
Select Web Crawler as the provider type

Configure:

Name: Product Documentation
Audience: Everyone (or select specific audience)

Seed URLs:
- https://docs.yourcompany.com

Crawl Scope: Same Domain
Render Mode: Automatic
URL Limit: 500
Concurrency Limit: 10

Include Patterns:
- https://docs.yourcompany.com/guides/*
- https://docs.yourcompany.com/api/*

Exclude Patterns:
- https://docs.yourcompany.com/internal/*

Exclude Selectors:
- .sidebar
- .related-articles

Save and Sync - The crawler will begin discovering and processing pages
Monitor - View the snapshot timeline to track progress

Start with a small URL limit (50-100 pages) for your first sync to verify the configuration captures the right content, then increase the limit as needed.

Collection

Manually create and manage knowledge snippets for content that doesn’t exist in crawlable sources. Best for: Internal procedures, FAQs, policy documents, quick knowledge additions How it works: Create a collection as a container, then add individual snippets of HTML content through the UI or API.

Creating a Collection

Navigate to Data Providers > Add Data Provider
Select Collection as the provider type

Configure:

Name: Internal Policies
Audience: Internal Team (optional)

Save - Your collection is ready for content

Adding Snippets

After creating a collection, add knowledge snippets:

Navigate to the collection detail page
Click Add Snippet
Provide:
- Name - Descriptive title for the snippet
- Content - HTML content (rich text supported)
- Audience - Optional audience to scope this specific snippet
Save - The snippet is immediately available to your AI

Collections don’t use snapshots or scheduling since content is added manually. Each snippet becomes a source immediately upon creation.

Collection Example

Creating a collection for internal HR policies:

Name: HR Policies

Snippets:
1. PTO Policy
   Content: <HTML content explaining PTO accrual, request process>
   Audience: Employees

2. Remote Work Guidelines
   Content: <HTML content with remote work requirements>
   Audience: Employees

3. Expense Reimbursement
   Content: <HTML content detailing expense submission process>
   Audience: Everyone

Confluence

Connect to Atlassian Confluence spaces to automatically sync documentation and wiki content. Best for: Confluence Cloud workspaces, internal documentation wikis, team knowledge bases Status: Configuration interface coming soon

The Confluence provider is currently in development. Contact support@botbrains.io if you need Confluence integration for your project.

Scheduling Automatic Syncs

Keep your AI’s knowledge fresh by scheduling regular syncs for web crawlers and other automated providers.

Setting a Schedule

When creating or editing a provider:

Locate the Schedule section
Select days to run syncs (Monday through Sunday)
Save - A random time between 1-5 AM will be assigned automatically

Random sync times prevent all providers from running simultaneously, which helps distribute server load and ensures reliable syncing.

Schedule Examples

Daily Documentation Updates

Days: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
Result: Syncs every day at the assigned time

Weekly Knowledge Refresh

Days: Sunday
Result: Syncs once per week on Sunday

Business Days Only

Days: Monday, Tuesday, Wednesday, Thursday, Friday
Result: Syncs on weekdays only

Disabling Schedules

To stop automatic syncs:

Edit the data provider
Deselect all days in the schedule section
Save - No automatic syncs will occur

You can still trigger manual syncs from the provider detail page.

Managing Snapshots

Viewing Snapshots

Each provider’s detail page shows a timeline of snapshots:

Navigate to the data provider
View the snapshot timeline showing sync history
Click a snapshot to expand and view details

Comparing Snapshots

See what changed between syncs:

Select a snapshot from the timeline
View the automatic comparison with the previous snapshot
Review:
- Added - New sources discovered
- Modified - Existing sources with content changes
- Removed - Sources no longer found

The first snapshot for a provider won’t show a comparison since there’s no previous snapshot to compare against.

Triggering Manual Syncs

Force a sync outside the scheduled times:

Navigate to the data provider detail page
Click Sync Now
Monitor the new snapshot’s progress in the timeline

Assigning Audiences to Providers

Control which users can access knowledge from a provider by assigning default audiences.

How It Works

When you assign an audience to a data provider:

All new sources discovered by that provider automatically inherit the audience assignment
Only users matching the audience criteria can see those sources in AI responses
Existing sources from previous syncs are not affected - only new sources from future syncs

Setting a Default Audience

Navigate to Data Providers > Edit Provider
Select an audience from the Audience dropdown
Save - Future syncs will assign this audience to new sources

Use Cases

Customer Documentation

Provider: Public Help Center
Audience: Everyone
Result: All customers see this knowledge

Enterprise Features

Provider: Enterprise Documentation
Audience: Enterprise Customers
Result: Only enterprise tier users see advanced features

Internal Knowledge

Provider: Internal Procedures Wiki
Audience: Internal Team
Result: Only team members access internal documentation

You can override the provider-level audience assignment for individual sources after they’re created. Navigate to the source detail page to change its audience.

Best Practices

Start with Clear Goals

Define what knowledge your AI needs before configuring providers. Focus on high-value content that answers common customer questions.

Test Crawl Configuration

Use small URL limits initially to verify your include/exclude patterns and selectors capture the right content before running full syncs.

Schedule Strategically

Consider how often your source content changes. Daily documentation sites need frequent syncs, while stable FAQs might only need weekly updates.

Monitor Snapshot Changes

Regularly review snapshot diffs to catch unintended changes, removed pages, or new content that needs attention.

Use Descriptive Names

Name providers clearly to indicate their purpose and content (e.g., “Product Docs - API Reference” vs. “Provider 1”).

Leverage Audiences

Assign appropriate audiences to providers to ensure users only see relevant knowledge for their segment or access level.

Troubleshooting

Snapshot Stuck in PENDING

Symptoms: Snapshot shows PENDING status for extended periods Solutions:

Check that seed URLs are accessible (not behind authentication)
Verify URL patterns aren’t too broad, causing excessive crawling
Reduce concurrency limit if the target site is rate-limiting requests
Contact support if the issue persists after 30 minutes

Missing Pages in Snapshot

Symptoms: Expected pages aren’t appearing in snapshot sources Solutions:

Verify pages are linked from seed URLs (crawler follows links)
Check include/exclude patterns aren’t blocking desired pages
Ensure crawl scope allows the page’s domain/hostname
Review URL limit - you may need to increase it
For JavaScript-heavy sites, try “JavaScript” render mode

Too Many Irrelevant Pages

Symptoms: Snapshot includes unwanted pages or sections Solutions:

Add exclude patterns for unwanted URL paths
Use exclude selectors to remove navigation, footers, and ads
Tighten crawl scope to “Same Hostname” or “Same Origin”
Reduce URL limit to focus on most important pages
Consider using include patterns to allowlist specific paths

Duplicate Content

Symptoms: Same content appearing multiple times from different URLs Solutions:

Disable “Query Aware” if query parameters don’t change content
Disable “Fragment Aware” if fragments don’t represent unique pages
Add exclude patterns for known duplicate URL patterns
Use include-only selectors to extract just the main content area

API Integration

Data providers can be managed programmatically via the botBrains API: Create a Provider

POST /api/v1/projects/{project_id}/data-providers

Trigger a Sync

POST /api/v1/projects/{project_id}/data-providers/{provider_id}/snapshots

List Snapshots

GET /api/v1/projects/{project_id}/data-providers/{provider_id}/snapshots

Add Snippet to Collection

POST /api/v1/projects/{project_id}/data-providers/{provider_id}/snippets

For complete API documentation, see the API Reference.

Next Steps

After setting up data providers:

Create Snippets - Add supplementary knowledge for specific topics
Configure Tables - Structure data for precise lookups and calculations
Instruct Your AI - Define how your AI uses this knowledge
Review Conversations - Monitor how your AI answers questions with this knowledge

Getting Started

Core Concepts

Train

Deploy

Analyze

Guides

Security

Other

More

​Why Use Data Providers?

​How Data Providers Work

​Provider Lifecycle

​Snapshots

​Provider Types

​Web Crawler

​Configuration Options

​Web Crawler Example

​Collection

​Creating a Collection

​Adding Snippets

​Collection Example

​Confluence

​Scheduling Automatic Syncs

​Setting a Schedule

​Schedule Examples

​Disabling Schedules

​Managing Snapshots

​Viewing Snapshots

​Comparing Snapshots

​Triggering Manual Syncs

​Assigning Audiences to Providers

​How It Works

​Setting a Default Audience

​Use Cases

​Best Practices

Start with Clear Goals

Test Crawl Configuration

Schedule Strategically

Monitor Snapshot Changes

Use Descriptive Names

Leverage Audiences

​Troubleshooting

​Snapshot Stuck in PENDING

​Missing Pages in Snapshot

​Too Many Irrelevant Pages

​Duplicate Content

​API Integration

​Next Steps

Why Use Data Providers?

How Data Providers Work

Provider Lifecycle

Snapshots

Provider Types

Web Crawler

Configuration Options

Web Crawler Example

Collection

Creating a Collection

Adding Snippets

Collection Example

Confluence

Scheduling Automatic Syncs

Setting a Schedule

Schedule Examples

Disabling Schedules

Managing Snapshots

Viewing Snapshots

Comparing Snapshots

Triggering Manual Syncs

Assigning Audiences to Providers

How It Works

Setting a Default Audience

Use Cases

Best Practices

Troubleshooting

Snapshot Stuck in PENDING

Missing Pages in Snapshot

Too Many Irrelevant Pages

Duplicate Content

API Integration

Next Steps