Skip to main content

Testing AI Agents

Testing AI agents is fundamentally different from testing traditional software. Your AI doesn’t follow fixed logic paths - it generates responses dynamically based on instructions, knowledge, and context. Test suites help you verify that changes to guidance, knowledge, or tools improve your AI without breaking existing capabilities.

Why Testing AI Agents Matters

Unlike conventional software where bugs are binary (works or doesn’t), AI behavior exists on a quality spectrum. Testing helps you:
  • Maintain quality standards - Ensure responses meet your accuracy and tone requirements
  • Prevent regressions - Catch when updates break working functionality
  • Build with confidence - Deploy changes knowing they won’t degrade user experience
  • Document expected behavior - Test cases serve as living specifications
  • Track improvements - Measure progress as you refine guidance and knowledge
  • Validate before production - Catch issues in development, not in customer conversations
Think of test suites as quality guardrails. They won’t prevent every issue, but they catch the most common problems before users encounter them.

Understanding Test Suites

A test suite is a collection of test cases that validate specific aspects of your AI’s behavior. You organize test cases into suites based on:
  • Feature area - Product questions, billing inquiries, technical support
  • User segment - Enterprise customers, free users, trial accounts
  • Integration channel - Website chat, Zendesk, Slack, Salesforce
  • Release cycle - Pre-deployment checks, regression tests, smoke tests

Anatomy of a Test Suite

Each test suite contains: Name and Description
  • Clear label indicating what you’re testing
  • Description explaining the suite’s purpose
  • Examples: “Sales Inquiries”, “Enterprise Customer Support”, “Product Knowledge”
Test Cases
  • Individual scenarios you want to validate
  • Each case tests one specific behavior or capability
  • Cases can have conversation history for context
Test Runs
  • Executions of all test cases in the suite
  • Creates a snapshot of pass/fail results
  • Tracks performance over time

Creating Test Suites

Navigate to Test Suites in the platform sidebar to manage your tests.

Creating Your First Suite

  1. Click “Create Your First Test Suite” (or “New Test Suite” if you have existing suites)
  2. Enter a descriptive name
    • Good: “Product Knowledge - Pricing Questions”
    • Avoid: “Test Suite 1”
  3. Write a clear description
    • Explain what behavior this suite validates
    • Include context about which guidance or knowledge it covers
  4. Save the suite
Start with a small suite (5-10 test cases) focused on your most critical user interactions. Expand as you identify more patterns to test.

Organizing Multiple Suites

As your testing grows, organize suites by: Critical User Paths
- Product Questions (General Users)
- Billing and Subscription Support
- Technical Troubleshooting
- Account Management
- Onboarding Flow
Customer Segments
- Enterprise Customer Interactions
- Free Tier Support
- Trial User Engagement
- Premium Support Cases
Integrations
- Website Chat Behavior
- Zendesk Agent Assist
- Slack Internal Support
- Salesforce Case Deflection

Writing Effective Test Cases

Test cases define the scenarios your AI should handle correctly. Each case specifies input (user messages) and expected output (evaluation criteria).

Creating a Test Case

Navigate to a test suite and click “Add Test Case”: Name
  • Describe the scenario being tested
  • Be specific about what you’re validating
  • Example: “Explain monthly pricing with discount code”
Description (optional)
  • Add context about why this test matters
  • Document edge cases or special conditions
  • Link to related conversations or issues
Conversation History
  • Build the context leading to the test
  • Add message pairs (user → assistant → user)
  • Final user message is what gets evaluated
Check Configuration
  • Define how to evaluate the response
  • Choose the appropriate check type
  • Set thresholds and expected values

Crafting User Queries

The conversation input should represent realistic user interactions: Single Turn Tests
User: "What's the difference between your Pro and Enterprise plans?"
Test single-question scenarios - most common pattern. Multi-Turn Context
User: "I'm interested in your product"
Assistant: "Great! What's your main use case?"
User: "I need to support 1000 customers per month"
Assistant: "For that volume, I'd recommend our Business or Enterprise plan."
User: "What's the pricing difference?"
Test context-dependent responses where history matters. Edge Cases
User: "Do you offer student discounts for non-profit organizations in Europe?"
Test complex scenarios combining multiple conditions. Ambiguous Queries
User: "It's not working"
Test how AI handles vague questions (should ask clarifying questions).
Avoid testing random or nonsensical inputs. Focus on realistic user questions that represent actual patterns from your conversations.

Choosing Check Types

botBrains offers four evaluation methods:

Semantically Equivalent

Use when: You want the AI to convey specific information, but exact wording doesn’t matter. Configuration:
  • Expected Value: The core message the response should contain
  • Threshold: Similarity percentage (60-100%)
Example:
Expected: "Enterprise plan includes priority support, dedicated account manager,
and custom SLA. Pricing starts at $500/month."
Threshold: 75%

✓ Pass: "Our Enterprise tier gives you priority support with a dedicated account
manager and customized SLA. Plans begin at $500 monthly."

✗ Fail: "Enterprise is our premium option with lots of benefits. Contact sales
for pricing."
Best for:
  • Knowledge accuracy tests
  • Factual information validation
  • Concept explanation checks
Start with 70% threshold and adjust based on results. Higher thresholds (85%+) require very close matches. Lower thresholds (60-70%) allow more variation in wording.

Matches Pattern

Use when: The response must contain specific keywords, phrases, or formatting. Configuration:
  • Pattern: Regular expression to match against response
Examples:
Pattern: \$\d+/month
Use: Verify pricing is mentioned with dollar amount and /month

Pattern: (?i)(enterprise|business|pro) plan
Use: Ensure specific plan tiers are referenced

Pattern: ^(I understand|I see|Got it)
Use: Validate response starts with acknowledgment
Best for:
  • Format validation (dates, prices, codes)
  • Required phrase inclusion
  • Structured response checks
  • Compliance language verification

Not Matches Pattern

Use when: The response should NOT contain certain content. Configuration:
  • Pattern: Regular expression that should NOT appear in response
Examples:
Pattern: (?i)(I don't know|I'm not sure|I can't help)
Use: Ensure AI doesn't express uncertainty inappropriately

Pattern: \[TODO\]|PLACEHOLDER
Use: Verify no template text appears in responses

Pattern: (?i)confidential|internal only
Use: Prevent leakage of internal information
Best for:
  • Prohibited content detection
  • Tone/voice violations
  • Information security checks
  • Avoiding specific phrases

Classify As

Use when: The response should be categorized into predefined classes. Configuration:
  • Options: List of possible classifications (comma-separated)
  • Expected: Which classifications should apply (can be multiple)
Examples:
Options: helpful, neutral, unhelpful
Expected: helpful
Use: Validate response helpfulness

Options: sales, support, product info, account management
Expected: product info
Use: Check topic categorization

Options: empathetic, professional, casual, robotic
Expected: empathetic, professional
Use: Assess tone appropriateness
Best for:
  • Tone and sentiment validation
  • Topic classification
  • Multi-dimensional quality checks
  • Intent verification

Test Case Best Practices

Focus on User Value
✓ Good: "User asks about Enterprise features - AI lists key differentiators"
✗ Avoid: "Test if GPT-4 embedding similarity > 0.85"
Write from user perspective, not technical implementation. One Behavior Per Test
✓ Good: Separate tests for "pricing accuracy" and "tone appropriateness"
✗ Avoid: One test checking pricing, tone, format, and tool usage
Specific tests make failures easier to diagnose. Use Realistic Conversations Pull test cases from actual user interactions in Analyze → Conversations. Real questions beat invented scenarios. Cover Edge Cases
- Misspellings: "enterprize plan"
- Vague requests: "I need help"
- Out of scope: "What's the weather?"
- Multi-part: "Tell me about pricing and also can I get a demo?"
Test Tool Usage Include cases where AI should (and shouldn’t) use specific tools like search or escalation.

Running Tests

Once you have test cases, run the suite to evaluate current AI behavior.

Starting a Test Run

  1. Navigate to your test suite
  2. Click “Run Test Suite” in the test runs section
  3. Enter a name for this run
    • Include version or purpose: “v0.5 pre-deploy”, “Post pricing update”
  4. Add description (optional)
    • Note what changed since last run
    • Document what you’re validating
  5. Click “Create Test Run”
The platform executes each test case against your current AI configuration:
  • Sends the user message(s) to the AI
  • Collects the AI’s response
  • Evaluates against the check criteria
  • Records pass/fail for each test
Test runs execute against your currently deployed AI behavior. Make sure you’ve built and deployed the version you want to test.

Test Execution

During execution:
  • Status updates show progress through test cases
  • Real-time results appear as tests complete
  • Polling continues until all tests finish
  • Duration varies based on suite size (typically 30-120 seconds)

Interpreting Results

Test run results show: Overall Statistics
  • Total test cases
  • Passed count
  • Failed count
  • Pass rate percentage
Individual Test Results For each test case:
  • Passed - AI response met criteria
  • Failed - AI response didn’t meet criteria
  • Actual Response - What the AI said
  • Evaluation Details - Why it passed or failed
Reviewing Failures Click on a failed test to see:
  1. Input conversation - What was sent to the AI
  2. Expected outcome - What the check looked for
  3. Actual response - What the AI produced
  4. Evaluation reasoning - Why it didn’t match

Common Failure Patterns

Threshold Too High (Semantic Equivalence)
Expected: "Enterprise plan costs $500/month"
Actual: "Enterprise pricing is $500 monthly"
Threshold: 95%
Result: Failed at 88% similarity

Fix: Lower threshold to 85% or make expected value less specific
Pattern Too Strict (Regex Matches)
Pattern: \$500/month
Actual: "The cost is $500 per month"
Result: Failed (per month ≠ /month)

Fix: Update pattern to \$500.?(per month|/month|monthly)
Knowledge Gap
Expected: "Pro plan includes 50 seats"
Actual: "Pro plan is great for growing teams"
Result: Failed - missing specific information

Fix: Add seat information to knowledge base or guidance
Wrong Guidance Applied
Expected: Professional, concise answer
Actual: Casual, verbose response
Result: Failed tone check

Fix: Check audience filters - may be using wrong guidance rule

Test-Driven AI Development

Adopt a workflow that uses tests to guide improvements:

The TDD Cycle for AI

1. Write Test for Desired Behavior
Example: "User asks about mobile app availability"
Expected: "Our mobile app is available on iOS and Android.
Download from the App Store or Google Play."
2. Run Test (Expect Failure)
Result: Failed - AI says "I don't have information about mobile apps"
Reason: Knowledge gap
3. Make Minimum Change
Action: Add snippet about mobile app availability
Content: Platform supports mobile apps on iOS and Android,
available in respective app stores
4. Rebuild and Deploy
Build new version with updated knowledge
Deploy to testing environment
5. Run Test Again
Result: Passed - AI now provides correct information
6. Add Regression Test
Keep the test in suite to prevent future breaks

Progressive Test Coverage

Build your test suite incrementally: Week 1: Core Functionality (10-15 tests)
  • Most common user questions
  • Critical product information
  • Key workflows (signup, pricing, support)
Week 2: Edge Cases (5-10 tests)
  • Unusual but important scenarios
  • Multi-step interactions
  • Context-dependent responses
Week 3: Tone and Quality (5-10 tests)
  • Brand voice compliance
  • Empathy in support scenarios
  • Professional language
Week 4: Tool Usage (5-10 tests)
  • When to search vs. use knowledge
  • Escalation triggers
  • Web fetch appropriateness
Ongoing: Regression Prevention
  • Add test for every bug fix
  • Cover new features as you launch
  • Update tests when requirements change

Integrating with CI/CD

Use test suites as quality gates in your deployment pipeline.

Pre-Deployment Checklist

Before deploying a new AI version:
  1. Run all test suites against the new version
  2. Review failures - Are they expected changes or bugs?
  3. Update tests if requirements legitimately changed
  4. Fix issues if tests caught real problems
  5. Re-run until acceptable pass rate achieved
  6. Deploy with confidence
Critical User Paths: 100% pass rate
  • Signup flow, payment questions, account access
  • Zero tolerance for failures in critical areas
General Support: 90%+ pass rate
  • Minor variations acceptable
  • Review failures to identify patterns
Experimental Features: 70%+ pass rate
  • Early-stage capabilities
  • Tests help guide refinement
Never deploy if critical path tests fail. A broken signup flow or payment process damages trust and revenue.

Automation Opportunities

While botBrains doesn’t currently offer API-triggered test runs, you can:
  1. Schedule manual runs before each deployment window
  2. Document test results in deployment notes
  3. Track pass rates over time in spreadsheet
  4. Alert team when rates drop below threshold
Future API support will enable fully automated CI/CD integration.

Best Practices

Coverage Strategies

Start with Happy Paths Test the ideal user journey first:
1. User asks product question → AI provides accurate answer
2. User requests pricing → AI explains tiers clearly
3. User needs help → AI searches knowledge, then offers escalation
Add Sad Paths Test error handling and edge cases:
1. User asks off-topic question → AI politely declines
2. User provides vague request → AI asks clarifying questions
3. AI doesn't find answer → AI offers escalation appropriately
Test Boundaries Verify your AI stays in scope:
1. Request outside expertise → AI sets appropriate expectations
2. Confidential information request → AI refuses safely
3. Multiple questions → AI addresses all parts

Regression Test Creation

When to Add Regression Tests Add a test whenever:
  • User reports incorrect information
  • Escalation should have happened but didn’t
  • Response tone was inappropriate
  • Tool was used incorrectly
  • Edge case wasn’t handled
Regression Test Template
Name: [Bug ID] - [Brief description]
Description: Regression test for [issue]. Previously AI [wrong behavior],
should now [correct behavior].
Input: [Exact user message that triggered bug]
Check: [Verification that bug is fixed]
Example:
Name: BUG-123 - Student discount eligibility
Description: Regression test for student discount confusion. Previously AI
said students don't qualify, should clarify non-profit org discounts are
separate from education pricing.
Input: "Do you offer student discounts for non-profit organizations?"
Check: Semantically Equivalent
Expected: "We have separate programs: student discounts for individual
students in education, and non-profit organization discounts for registered
charities. Which applies to your situation?"
Threshold: 75%

Maintaining Test Suites

Regular Maintenance Tasks Monthly: Review and Update
  • Remove obsolete tests (deprecated features)
  • Update expected values for intentional changes
  • Add tests for new features
  • Adjust thresholds based on performance
After Major Changes
  • Guidance overhaul → Update all tone/style tests
  • Knowledge migration → Update semantic equivalence tests
  • Tool changes → Update tool usage tests
  • Audience changes → Add segment-specific tests
When Tests Become Noise Remove tests that:
  • Fail inconsistently without clear pattern
  • Test implementation details instead of user value
  • Duplicate coverage of other tests
  • Apply to removed features
A smaller suite with high-signal tests beats a large suite with noise. Quality over quantity.

Test Data Management

Sensitive Information Never include in test cases:
  • Real customer names, emails, or PII
  • Actual account numbers or IDs
  • Confidential business information
  • Internal system details
Use placeholder data:
✗ Avoid: "What's the status of order #123456 for john.smith@company.com?"
✓ Better: "What's the status of my order?"
Realistic but Generic
✓ "I'm on the Pro plan and need to upgrade to Enterprise for my team of 50"
✓ "Our company is evaluating your product for customer support automation"
✓ "I'm getting an error when trying to connect my Zendesk account"

Common Testing Patterns

Pattern 1: Knowledge Accuracy Suite

Goal: Verify factual information is correct Structure:
Test Suite: Product Knowledge - Pricing
├─ Monthly pricing for Pro plan
├─ Annual pricing with discount
├─ Enterprise custom pricing message
├─ Free tier limitations
└─ Trial period duration
Check Type: Semantically Equivalent (80% threshold) When to Run: After any knowledge updates, before deployment

Pattern 2: Tone Consistency Suite

Goal: Ensure brand voice across scenarios Structure:
Test Suite: Brand Voice - Professional & Empathetic
├─ Response to frustrated user
├─ Response to confused user
├─ Response to enthusiastic user
├─ Response to detailed technical question
└─ Response to simple question
Check Type: Classify As (tone classifications) When to Run: After guidance changes, monthly quality check

Pattern 3: Tool Usage Suite

Goal: Validate AI uses tools appropriately Structure:
Test Suite: Tool Usage - Search & Escalation
├─ Should search knowledge base for documented feature
├─ Should NOT search for basic greeting
├─ Should escalate complex technical issue
├─ Should NOT escalate simple password reset
└─ Should offer handoff after failed resolution
Check Type: Mix of Matches and Classify As When to Run: After tool configuration changes

Pattern 4: Multi-Turn Context Suite

Goal: Test context retention across conversation Structure:
Test Suite: Context Handling
├─ Follow-up question references previous answer
├─ Pronoun resolution ("it", "that", "them")
├─ Topic switch mid-conversation
├─ Returning to previous topic
└─ Contradictory information handling
Check Type: Semantically Equivalent When to Run: After model updates, monthly

Pattern 5: Segment-Specific Suite

Goal: Validate audience-targeted behavior Structure:
Test Suite: Enterprise Customer Experience
├─ Enterprise user gets priority escalation offer
├─ Enterprise user sees advanced features
├─ Enterprise user gets dedicated support mention
├─ Free user doesn't get enterprise messaging
└─ Free user sees appropriate upgrade path
Check Type: Mix of Semantically Equivalent and Classify As When to Run: After audience/guidance changes

Troubleshooting Test Failures

Inconsistent Results

Problem: Same test passes sometimes, fails others Common Causes:
  • Threshold too close to borderline (e.g., 75% when responses vary 70-80%)
  • AI response has acceptable variation in wording
  • Check criteria too strict for creative responses
Solutions:
1. Lower semantic threshold by 5-10%
2. Use pattern matching for must-have phrases, semantic for concepts
3. Accept that some variation is normal - update expected value to be more general
4. If truly random, this may indicate guidance needs to be more specific

Unexpected Passes

Problem: Test passes but manual review shows poor quality Common Causes:
  • Expected value too vague
  • Threshold too low
  • Check type doesn’t match what you’re validating
Solutions:
1. Make expected value more specific and detailed
2. Increase semantic threshold
3. Switch check type (e.g., semantic → pattern match for exact phrases)
4. Add multiple checks - one for content, one for tone

All Tests Failing

Problem: Entire suite suddenly fails Common Causes:
  • Wrong AI version deployed
  • Guidance deactivated
  • Required tools disabled
  • Knowledge source removed
Solutions:
1. Check deployed version - is it what you intended to test?
2. Verify guidance rules are active
3. Confirm tools are enabled
4. Review knowledge sources and data providers
5. Check for recent configuration changes

False Negatives

Problem: Test fails but response is actually good Common Causes:
  • Expected value doesn’t match how AI naturally phrases things
  • Threshold too high
  • Pattern too specific
Solutions:
1. Update expected value to match AI's phrasing style
2. Reduce threshold to 70-80% for concept matching
3. Broaden regex pattern to accept variations
4. Consider if test expectations are realistic

Measuring Test Suite Health

Track these metrics to ensure your testing remains effective:

Coverage Metrics

Critical Path Coverage
(Test cases covering critical scenarios / Total critical scenarios) × 100
Target: 100%
Feature Coverage
(Features with test cases / Total features) × 100
Target: 80%+
Tool Coverage
(Tools with usage tests / Total tools enabled) × 100
Target: 100%

Quality Metrics

Pass Rate Trend
Track over time - should remain stable or improve
Declining pass rate indicates quality issues or outdated tests
False Positive Rate
(Tests passing that should fail / Total tests) × 100
Target: <5%
False Negative Rate
(Tests failing that should pass / Total tests) × 100
Target: <10%
Test Maintenance Frequency
Tests updated per deployment
Target: Review 10-20% of tests per major deployment

Next Steps

Now that you understand AI testing: Testing is continuous. As your AI evolves and user needs change, your test suite should evolve too. The investment in testing pays dividends in quality, reliability, and peace of mind.