Testing AI Agents

Testing AI agents is fundamentally different from testing traditional software. Your AI doesn’t follow fixed logic paths - it generates responses dynamically based on instructions, knowledge, and context. Test suites help you verify that changes to guidance, knowledge, or tools improve your AI without breaking existing capabilities.

Why Testing AI Agents Matters

Unlike conventional software where bugs are binary (works or doesn’t), AI behavior exists on a quality spectrum. Testing helps you:

Maintain quality standards - Ensure responses meet your accuracy and tone requirements
Prevent regressions - Catch when updates break working functionality
Build with confidence - Deploy changes knowing they won’t degrade user experience
Document expected behavior - Test cases serve as living specifications
Track improvements - Measure progress as you refine guidance and knowledge
Validate before production - Catch issues in development, not in customer conversations

Think of test suites as quality guardrails. They won’t prevent every issue, but they catch the most common problems before users encounter them.

Understanding Test Suites

A test suite is a collection of test cases that validate specific aspects of your AI’s behavior. You organize test cases into suites based on:

Feature area - Product questions, billing inquiries, technical support
User segment - Enterprise customers, free users, trial accounts
Integration channel - Website chat, Zendesk, Slack, Salesforce
Release cycle - Pre-deployment checks, regression tests, smoke tests

Anatomy of a Test Suite

Each test suite contains: Name and Description

Clear label indicating what you’re testing
Description explaining the suite’s purpose
Examples: “Sales Inquiries”, “Enterprise Customer Support”, “Product Knowledge”

Test Cases

Individual scenarios you want to validate
Each case tests one specific behavior or capability
Cases can have conversation history for context

Test Runs

Executions of all test cases in the suite
Creates a snapshot of pass/fail results
Tracks performance over time

Creating Test Suites

Navigate to Test Suites in the platform sidebar to manage your tests.

Creating Your First Suite

Click “Create Your First Test Suite” (or “New Test Suite” if you have existing suites)
Enter a descriptive name
- Good: “Product Knowledge - Pricing Questions”
- Avoid: “Test Suite 1”
Write a clear description
- Explain what behavior this suite validates
- Include context about which guidance or knowledge it covers
Save the suite

Start with a small suite (5-10 test cases) focused on your most critical user interactions. Expand as you identify more patterns to test.

Organizing Multiple Suites

As your testing grows, organize suites by: Critical User Paths

- Product Questions (General Users)
- Billing and Subscription Support
- Technical Troubleshooting
- Account Management
- Onboarding Flow

Customer Segments

- Enterprise Customer Interactions
- Free Tier Support
- Trial User Engagement
- Premium Support Cases

Integrations

- Website Chat Behavior
- Zendesk Agent Assist
- Slack Internal Support
- Salesforce Case Deflection

Writing Effective Test Cases

Test cases define the scenarios your AI should handle correctly. Each case specifies input (user messages) and expected output (evaluation criteria).

Creating a Test Case

Navigate to a test suite and click “Add Test Case”: Name

Describe the scenario being tested
Be specific about what you’re validating
Example: “Explain monthly pricing with discount code”

Description (optional)

Add context about why this test matters
Document edge cases or special conditions
Link to related conversations or issues

Conversation History

Build the context leading to the test
Add message pairs (user → assistant → user)
Final user message is what gets evaluated

Check Configuration

Define how to evaluate the response
Choose the appropriate check type
Set thresholds and expected values

Crafting User Queries

The conversation input should represent realistic user interactions: Single Turn Tests

User: "What's the difference between your Pro and Enterprise plans?"

Test single-question scenarios - most common pattern. Multi-Turn Context

User: "I'm interested in your product"
Assistant: "Great! What's your main use case?"
User: "I need to support 1000 customers per month"
Assistant: "For that volume, I'd recommend our Business or Enterprise plan."
User: "What's the pricing difference?"

Test context-dependent responses where history matters. Edge Cases

User: "Do you offer student discounts for non-profit organizations in Europe?"

Test complex scenarios combining multiple conditions. Ambiguous Queries

User: "It's not working"

Test how AI handles vague questions (should ask clarifying questions).

Avoid testing random or nonsensical inputs. Focus on realistic user questions that represent actual patterns from your conversations.

Choosing Check Types

botBrains offers four evaluation methods:

Semantically Equivalent

Use when: You want the AI to convey specific information, but exact wording doesn’t matter. Configuration:

Expected Value: The core message the response should contain
Threshold: Similarity percentage (60-100%)

Example:

Expected: "Enterprise plan includes priority support, dedicated account manager,
and custom SLA. Pricing starts at $500/month."
Threshold: 75%

✓ Pass: "Our Enterprise tier gives you priority support with a dedicated account
manager and customized SLA. Plans begin at $500 monthly."

✗ Fail: "Enterprise is our premium option with lots of benefits. Contact sales
for pricing."

Best for:

Knowledge accuracy tests
Factual information validation
Concept explanation checks

Start with 70% threshold and adjust based on results. Higher thresholds (85%+) require very close matches. Lower thresholds (60-70%) allow more variation in wording.

Matches Pattern

Use when: The response must contain specific keywords, phrases, or formatting. Configuration:

Pattern: Regular expression to match against response

Examples:

Pattern: \$\d+/month
Use: Verify pricing is mentioned with dollar amount and /month

Pattern: (?i)(enterprise|business|pro) plan
Use: Ensure specific plan tiers are referenced

Pattern: ^(I understand|I see|Got it)
Use: Validate response starts with acknowledgment

Best for:

Format validation (dates, prices, codes)
Required phrase inclusion
Structured response checks
Compliance language verification

Not Matches Pattern

Use when: The response should NOT contain certain content. Configuration:

Pattern: Regular expression that should NOT appear in response

Examples:

Pattern: (?i)(I don't know|I'm not sure|I can't help)
Use: Ensure AI doesn't express uncertainty inappropriately

Pattern: \[TODO\]|PLACEHOLDER
Use: Verify no template text appears in responses

Pattern: (?i)confidential|internal only
Use: Prevent leakage of internal information

Best for:

Prohibited content detection
Tone/voice violations
Information security checks
Avoiding specific phrases

Classify As

Use when: The response should be categorized into predefined classes. Configuration:

Options: List of possible classifications (comma-separated)
Expected: Which classifications should apply (can be multiple)

Examples:

Options: helpful, neutral, unhelpful
Expected: helpful
Use: Validate response helpfulness

Options: sales, support, product info, account management
Expected: product info
Use: Check topic categorization

Options: empathetic, professional, casual, robotic
Expected: empathetic, professional
Use: Assess tone appropriateness

Best for:

Tone and sentiment validation
Topic classification
Multi-dimensional quality checks
Intent verification

Test Case Best Practices

Focus on User Value

✓ Good: "User asks about Enterprise features - AI lists key differentiators"
✗ Avoid: "Test if GPT-4 embedding similarity > 0.85"

Write from user perspective, not technical implementation. One Behavior Per Test

✓ Good: Separate tests for "pricing accuracy" and "tone appropriateness"
✗ Avoid: One test checking pricing, tone, format, and tool usage

Specific tests make failures easier to diagnose. Use Realistic Conversations Pull test cases from actual user interactions in Analyze → Conversations. Real questions beat invented scenarios. Cover Edge Cases

- Misspellings: "enterprize plan"
- Vague requests: "I need help"
- Out of scope: "What's the weather?"
- Multi-part: "Tell me about pricing and also can I get a demo?"

Test Tool Usage Include cases where AI should (and shouldn’t) use specific tools like search or escalation.

Running Tests

Once you have test cases, run the suite to evaluate current AI behavior.

Starting a Test Run

Navigate to your test suite
Click “Run Test Suite” in the test runs section
Enter a name for this run
- Include version or purpose: “v0.5 pre-deploy”, “Post pricing update”
Add description (optional)
- Note what changed since last run
- Document what you’re validating
Click “Create Test Run”

The platform executes each test case against your current AI configuration:

Sends the user message(s) to the AI
Collects the AI’s response
Evaluates against the check criteria
Records pass/fail for each test

Test runs execute against your currently deployed AI behavior. Make sure you’ve built and deployed the version you want to test.

Test Execution

During execution:

Status updates show progress through test cases
Real-time results appear as tests complete
Polling continues until all tests finish
Duration varies based on suite size (typically 30-120 seconds)

Interpreting Results

Test run results show: Overall Statistics

Total test cases
Passed count
Failed count
Pass rate percentage

Individual Test Results For each test case:

✓ Passed - AI response met criteria
✗ Failed - AI response didn’t meet criteria
Actual Response - What the AI said
Evaluation Details - Why it passed or failed

Reviewing Failures Click on a failed test to see:

Input conversation - What was sent to the AI
Expected outcome - What the check looked for
Actual response - What the AI produced
Evaluation reasoning - Why it didn’t match

Common Failure Patterns

Threshold Too High (Semantic Equivalence)

Expected: "Enterprise plan costs $500/month"
Actual: "Enterprise pricing is $500 monthly"
Threshold: 95%
Result: Failed at 88% similarity

Fix: Lower threshold to 85% or make expected value less specific

Pattern Too Strict (Regex Matches)

Pattern: \$500/month
Actual: "The cost is $500 per month"
Result: Failed (per month ≠ /month)

Fix: Update pattern to \$500.?(per month|/month|monthly)

Knowledge Gap

Expected: "Pro plan includes 50 seats"
Actual: "Pro plan is great for growing teams"
Result: Failed - missing specific information

Fix: Add seat information to knowledge base or guidance

Wrong Guidance Applied

Expected: Professional, concise answer
Actual: Casual, verbose response
Result: Failed tone check

Fix: Check audience filters - may be using wrong guidance rule

Test-Driven AI Development

Adopt a workflow that uses tests to guide improvements:

The TDD Cycle for AI

1. Write Test for Desired Behavior

Example: "User asks about mobile app availability"
Expected: "Our mobile app is available on iOS and Android.
Download from the App Store or Google Play."

2. Run Test (Expect Failure)

Result: Failed - AI says "I don't have information about mobile apps"
Reason: Knowledge gap

3. Make Minimum Change

Action: Add snippet about mobile app availability
Content: Platform supports mobile apps on iOS and Android,
available in respective app stores

4. Rebuild and Deploy

Build new version with updated knowledge
Deploy to testing environment

5. Run Test Again

Result: Passed - AI now provides correct information

6. Add Regression Test

Keep the test in suite to prevent future breaks

Progressive Test Coverage

Build your test suite incrementally: Week 1: Core Functionality (10-15 tests)

Most common user questions
Critical product information
Key workflows (signup, pricing, support)

Week 2: Edge Cases (5-10 tests)

Unusual but important scenarios
Multi-step interactions
Context-dependent responses

Week 3: Tone and Quality (5-10 tests)

Brand voice compliance
Empathy in support scenarios
Professional language

Week 4: Tool Usage (5-10 tests)

When to search vs. use knowledge
Escalation triggers
Web fetch appropriateness

Ongoing: Regression Prevention

Add test for every bug fix
Cover new features as you launch
Update tests when requirements change

Integrating with CI/CD

Use test suites as quality gates in your deployment pipeline.

Pre-Deployment Checklist

Before deploying a new AI version:

Run all test suites against the new version
Review failures - Are they expected changes or bugs?
Update tests if requirements legitimately changed
Fix issues if tests caught real problems
Re-run until acceptable pass rate achieved
Deploy with confidence

Recommended Pass Thresholds

Critical User Paths: 100% pass rate

Signup flow, payment questions, account access
Zero tolerance for failures in critical areas

General Support: 90%+ pass rate

Minor variations acceptable
Review failures to identify patterns

Experimental Features: 70%+ pass rate

Early-stage capabilities
Tests help guide refinement

Never deploy if critical path tests fail. A broken signup flow or payment process damages trust and revenue.

Automation Opportunities

While botBrains doesn’t currently offer API-triggered test runs, you can:

Schedule manual runs before each deployment window
Document test results in deployment notes
Track pass rates over time in spreadsheet
Alert team when rates drop below threshold

Future API support will enable fully automated CI/CD integration.

Best Practices

Coverage Strategies

Start with Happy Paths Test the ideal user journey first:

User asks product question → AI provides accurate answer
User requests pricing → AI explains tiers clearly
User needs help → AI searches knowledge, then offers escalation

Add Sad Paths Test error handling and edge cases:

User asks off-topic question → AI politely declines
User provides vague request → AI asks clarifying questions
AI doesn't find answer → AI offers escalation appropriately

Test Boundaries Verify your AI stays in scope:

Request outside expertise → AI sets appropriate expectations
Confidential information request → AI refuses safely
Multiple questions → AI addresses all parts

Regression Test Creation

When to Add Regression Tests Add a test whenever:

User reports incorrect information
Escalation should have happened but didn’t
Response tone was inappropriate
Tool was used incorrectly
Edge case wasn’t handled

Regression Test Template

Name: [Bug ID] - [Brief description]
Description: Regression test for [issue]. Previously AI [wrong behavior],
should now [correct behavior].
Input: [Exact user message that triggered bug]
Check: [Verification that bug is fixed]

Example:

Name: BUG-123 - Student discount eligibility
Description: Regression test for student discount confusion. Previously AI
said students don't qualify, should clarify non-profit org discounts are
separate from education pricing.
Input: "Do you offer student discounts for non-profit organizations?"
Check: Semantically Equivalent
Expected: "We have separate programs: student discounts for individual
students in education, and non-profit organization discounts for registered
charities. Which applies to your situation?"
Threshold: 75%

Maintaining Test Suites

Regular Maintenance Tasks Monthly: Review and Update

Remove obsolete tests (deprecated features)
Update expected values for intentional changes
Add tests for new features
Adjust thresholds based on performance

After Major Changes

Guidance overhaul → Update all tone/style tests
Knowledge migration → Update semantic equivalence tests
Tool changes → Update tool usage tests
Audience changes → Add segment-specific tests

When Tests Become Noise Remove tests that:

Fail inconsistently without clear pattern
Test implementation details instead of user value
Duplicate coverage of other tests
Apply to removed features

A smaller suite with high-signal tests beats a large suite with noise. Quality over quantity.

Test Data Management

Sensitive Information Never include in test cases:

Real customer names, emails, or PII
Actual account numbers or IDs
Confidential business information
Internal system details

Use placeholder data:

✗ Avoid: "What's the status of order #123456 for john.smith@company.com?"
✓ Better: "What's the status of my order?"

Realistic but Generic

✓ "I'm on the Pro plan and need to upgrade to Enterprise for my team of 50"
✓ "Our company is evaluating your product for customer support automation"
✓ "I'm getting an error when trying to connect my Zendesk account"

Common Testing Patterns

Pattern 1: Knowledge Accuracy Suite

Goal: Verify factual information is correct Structure:

Test Suite: Product Knowledge - Pricing
├─ Monthly pricing for Pro plan
├─ Annual pricing with discount
├─ Enterprise custom pricing message
├─ Free tier limitations
└─ Trial period duration

Check Type: Semantically Equivalent (80% threshold) When to Run: After any knowledge updates, before deployment

Pattern 2: Tone Consistency Suite

Goal: Ensure brand voice across scenarios Structure:

Test Suite: Brand Voice - Professional & Empathetic
├─ Response to frustrated user
├─ Response to confused user
├─ Response to enthusiastic user
├─ Response to detailed technical question
└─ Response to simple question

Check Type: Classify As (tone classifications) When to Run: After guidance changes, monthly quality check

Pattern 3: Tool Usage Suite

Goal: Validate AI uses tools appropriately Structure:

Test Suite: Tool Usage - Search & Escalation
├─ Should search knowledge base for documented feature
├─ Should NOT search for basic greeting
├─ Should escalate complex technical issue
├─ Should NOT escalate simple password reset
└─ Should offer handoff after failed resolution

Check Type: Mix of Matches and Classify As When to Run: After tool configuration changes

Pattern 4: Multi-Turn Context Suite

Goal: Test context retention across conversation Structure:

Test Suite: Context Handling
├─ Follow-up question references previous answer
├─ Pronoun resolution ("it", "that", "them")
├─ Topic switch mid-conversation
├─ Returning to previous topic
└─ Contradictory information handling

Check Type: Semantically Equivalent When to Run: After model updates, monthly

Pattern 5: Segment-Specific Suite

Goal: Validate audience-targeted behavior Structure:

Test Suite: Enterprise Customer Experience
├─ Enterprise user gets priority escalation offer
├─ Enterprise user sees advanced features
├─ Enterprise user gets dedicated support mention
├─ Free user doesn't get enterprise messaging
└─ Free user sees appropriate upgrade path

Check Type: Mix of Semantically Equivalent and Classify As When to Run: After audience/guidance changes

Troubleshooting Test Failures

Inconsistent Results

Problem: Same test passes sometimes, fails others Common Causes:

Threshold too close to borderline (e.g., 75% when responses vary 70-80%)
AI response has acceptable variation in wording
Check criteria too strict for creative responses

Solutions:

Lower semantic threshold by 5-10%
Use pattern matching for must-have phrases, semantic for concepts
Accept that some variation is normal - update expected value to be more general
If truly random, this may indicate guidance needs to be more specific

Unexpected Passes

Problem: Test passes but manual review shows poor quality Common Causes:

Expected value too vague
Threshold too low
Check type doesn’t match what you’re validating

Solutions:

Make expected value more specific and detailed
Increase semantic threshold
Switch check type (e.g., semantic → pattern match for exact phrases)
Add multiple checks - one for content, one for tone

All Tests Failing

Problem: Entire suite suddenly fails Common Causes:

Wrong AI version deployed
Guidance deactivated
Required tools disabled
Knowledge source removed

Solutions:

Check deployed version - is it what you intended to test?
Verify guidance rules are active
Confirm tools are enabled
Review knowledge sources and data providers
Check for recent configuration changes

False Negatives

Problem: Test fails but response is actually good Common Causes:

Expected value doesn’t match how AI naturally phrases things
Threshold too high
Pattern too specific

Solutions:

Update expected value to match AI's phrasing style
Reduce threshold to 70-80% for concept matching
Broaden regex pattern to accept variations
Consider if test expectations are realistic

Measuring Test Suite Health

Track these metrics to ensure your testing remains effective:

Coverage Metrics

Critical Path Coverage

(Test cases covering critical scenarios / Total critical scenarios) × 100
Target: 100%

Feature Coverage

(Features with test cases / Total features) × 100
Target: 80%+

Tool Coverage

(Tools with usage tests / Total tools enabled) × 100
Target: 100%

Quality Metrics

Pass Rate Trend

Track over time - should remain stable or improve
Declining pass rate indicates quality issues or outdated tests

False Positive Rate

(Tests passing that should fail / Total tests) × 100
Target: <5%

False Negative Rate

(Tests failing that should pass / Total tests) × 100
Target: <10%

Test Maintenance Frequency

Tests updated per deployment
Target: Review 10-20% of tests per major deployment

Next Steps

Now that you understand AI testing:

Improve Answers - Use test failures to guide improvements
Instruct AI Agent - Refine guidance to pass more tests
Add Knowledge - Fill gaps revealed by failing tests
Deploy Changes - Ship with confidence after tests pass

Testing is continuous. As your AI evolves and user needs change, your test suite should evolve too. The investment in testing pays dividends in quality, reliability, and peace of mind.

Getting Started

Core Concepts

Train

Deploy

Analyze

Guides

Security

Other

More

​Testing AI Agents

​Why Testing AI Agents Matters

​Understanding Test Suites

​Anatomy of a Test Suite

​Creating Test Suites

​Creating Your First Suite

​Organizing Multiple Suites

​Writing Effective Test Cases

​Creating a Test Case

​Crafting User Queries

​Choosing Check Types

​Semantically Equivalent

​Matches Pattern

​Not Matches Pattern

​Classify As

​Test Case Best Practices

​Running Tests

​Starting a Test Run

​Test Execution

​Interpreting Results

​Common Failure Patterns

​Test-Driven AI Development

​The TDD Cycle for AI

​Progressive Test Coverage

​Integrating with CI/CD

​Pre-Deployment Checklist

​Recommended Pass Thresholds

​Automation Opportunities

​Best Practices

​Coverage Strategies

​Regression Test Creation

​Maintaining Test Suites

​Test Data Management

​Common Testing Patterns

​Pattern 1: Knowledge Accuracy Suite

​Pattern 2: Tone Consistency Suite

​Pattern 3: Tool Usage Suite

​Pattern 4: Multi-Turn Context Suite

​Pattern 5: Segment-Specific Suite

​Troubleshooting Test Failures

​Inconsistent Results

​Unexpected Passes

​All Tests Failing

​False Negatives

​Measuring Test Suite Health

​Coverage Metrics

​Quality Metrics

​Next Steps

Testing AI Agents

Why Testing AI Agents Matters

Understanding Test Suites

Anatomy of a Test Suite

Creating Test Suites

Creating Your First Suite

Organizing Multiple Suites

Writing Effective Test Cases

Creating a Test Case

Crafting User Queries

Choosing Check Types

Semantically Equivalent

Matches Pattern

Not Matches Pattern

Classify As

Test Case Best Practices

Running Tests

Starting a Test Run

Test Execution

Interpreting Results

Common Failure Patterns

Test-Driven AI Development

The TDD Cycle for AI

Progressive Test Coverage

Integrating with CI/CD

Pre-Deployment Checklist

Recommended Pass Thresholds

Automation Opportunities

Best Practices

Coverage Strategies

Regression Test Creation

Maintaining Test Suites

Test Data Management

Common Testing Patterns

Pattern 1: Knowledge Accuracy Suite

Pattern 2: Tone Consistency Suite

Pattern 3: Tool Usage Suite

Pattern 4: Multi-Turn Context Suite

Pattern 5: Segment-Specific Suite

Troubleshooting Test Failures

Inconsistent Results

Unexpected Passes

All Tests Failing

False Negatives

Measuring Test Suite Health

Coverage Metrics

Quality Metrics

Next Steps