Testing AI Agents
Testing AI agents is fundamentally different from testing traditional software. Your AI doesn’t follow fixed logic paths - it generates responses dynamically based on instructions, knowledge, and context. Test suites help you verify that changes to guidance, knowledge, or tools improve your AI without breaking existing capabilities.
Why Testing AI Agents Matters
Unlike conventional software where bugs are binary (works or doesn’t), AI behavior exists on a quality spectrum. Testing helps you:
- Maintain quality standards - Ensure responses meet your accuracy and tone requirements
- Prevent regressions - Catch when updates break working functionality
- Build with confidence - Deploy changes knowing they won’t degrade user experience
- Document expected behavior - Test cases serve as living specifications
- Track improvements - Measure progress as you refine guidance and knowledge
- Validate before production - Catch issues in development, not in customer conversations
Think of test suites as quality guardrails. They won’t prevent every issue, but they catch the most common problems before users encounter them.
Understanding Test Suites
A test suite is a collection of test cases that validate specific aspects of your AI’s behavior. You organize test cases into suites based on:
- Feature area - Product questions, billing inquiries, technical support
- User segment - Enterprise customers, free users, trial accounts
- Integration channel - Website chat, Zendesk, Slack, Salesforce
- Release cycle - Pre-deployment checks, regression tests, smoke tests
Anatomy of a Test Suite
Each test suite contains:
Name and Description
- Clear label indicating what you’re testing
- Description explaining the suite’s purpose
- Examples: “Sales Inquiries”, “Enterprise Customer Support”, “Product Knowledge”
Test Cases
- Individual scenarios you want to validate
- Each case tests one specific behavior or capability
- Cases can have conversation history for context
Test Runs
- Executions of all test cases in the suite
- Creates a snapshot of pass/fail results
- Tracks performance over time
Creating Test Suites
Navigate to Test Suites in the platform sidebar to manage your tests.
Creating Your First Suite
- Click “Create Your First Test Suite” (or “New Test Suite” if you have existing suites)
- Enter a descriptive name
- Good: “Product Knowledge - Pricing Questions”
- Avoid: “Test Suite 1”
- Write a clear description
- Explain what behavior this suite validates
- Include context about which guidance or knowledge it covers
- Save the suite
Start with a small suite (5-10 test cases) focused on your most critical user interactions. Expand as you identify more patterns to test.
Organizing Multiple Suites
As your testing grows, organize suites by:
Critical User Paths
- Product Questions (General Users)
- Billing and Subscription Support
- Technical Troubleshooting
- Account Management
- Onboarding Flow
Customer Segments
- Enterprise Customer Interactions
- Free Tier Support
- Trial User Engagement
- Premium Support Cases
Integrations
- Website Chat Behavior
- Zendesk Agent Assist
- Slack Internal Support
- Salesforce Case Deflection
Writing Effective Test Cases
Test cases define the scenarios your AI should handle correctly. Each case specifies input (user messages) and expected output (evaluation criteria).
Creating a Test Case
Navigate to a test suite and click “Add Test Case”:
Name
- Describe the scenario being tested
- Be specific about what you’re validating
- Example: “Explain monthly pricing with discount code”
Description (optional)
- Add context about why this test matters
- Document edge cases or special conditions
- Link to related conversations or issues
Conversation History
- Build the context leading to the test
- Add message pairs (user → assistant → user)
- Final user message is what gets evaluated
Check Configuration
- Define how to evaluate the response
- Choose the appropriate check type
- Set thresholds and expected values
Crafting User Queries
The conversation input should represent realistic user interactions:
Single Turn Tests
User: "What's the difference between your Pro and Enterprise plans?"
Test single-question scenarios - most common pattern.
Multi-Turn Context
User: "I'm interested in your product"
Assistant: "Great! What's your main use case?"
User: "I need to support 1000 customers per month"
Assistant: "For that volume, I'd recommend our Business or Enterprise plan."
User: "What's the pricing difference?"
Test context-dependent responses where history matters.
Edge Cases
User: "Do you offer student discounts for non-profit organizations in Europe?"
Test complex scenarios combining multiple conditions.
Ambiguous Queries
Test how AI handles vague questions (should ask clarifying questions).
Avoid testing random or nonsensical inputs. Focus on realistic user questions that represent actual patterns from your conversations.
Choosing Check Types
botBrains offers four evaluation methods:
Semantically Equivalent
Use when: You want the AI to convey specific information, but exact wording doesn’t matter.
Configuration:
- Expected Value: The core message the response should contain
- Threshold: Similarity percentage (60-100%)
Example:
Expected: "Enterprise plan includes priority support, dedicated account manager,
and custom SLA. Pricing starts at $500/month."
Threshold: 75%
✓ Pass: "Our Enterprise tier gives you priority support with a dedicated account
manager and customized SLA. Plans begin at $500 monthly."
✗ Fail: "Enterprise is our premium option with lots of benefits. Contact sales
for pricing."
Best for:
- Knowledge accuracy tests
- Factual information validation
- Concept explanation checks
Start with 70% threshold and adjust based on results. Higher thresholds (85%+) require very close matches. Lower thresholds (60-70%) allow more variation in wording.
Matches Pattern
Use when: The response must contain specific keywords, phrases, or formatting.
Configuration:
- Pattern: Regular expression to match against response
Examples:
Pattern: \$\d+/month
Use: Verify pricing is mentioned with dollar amount and /month
Pattern: (?i)(enterprise|business|pro) plan
Use: Ensure specific plan tiers are referenced
Pattern: ^(I understand|I see|Got it)
Use: Validate response starts with acknowledgment
Best for:
- Format validation (dates, prices, codes)
- Required phrase inclusion
- Structured response checks
- Compliance language verification
Not Matches Pattern
Use when: The response should NOT contain certain content.
Configuration:
- Pattern: Regular expression that should NOT appear in response
Examples:
Pattern: (?i)(I don't know|I'm not sure|I can't help)
Use: Ensure AI doesn't express uncertainty inappropriately
Pattern: \[TODO\]|PLACEHOLDER
Use: Verify no template text appears in responses
Pattern: (?i)confidential|internal only
Use: Prevent leakage of internal information
Best for:
- Prohibited content detection
- Tone/voice violations
- Information security checks
- Avoiding specific phrases
Classify As
Use when: The response should be categorized into predefined classes.
Configuration:
- Options: List of possible classifications (comma-separated)
- Expected: Which classifications should apply (can be multiple)
Examples:
Options: helpful, neutral, unhelpful
Expected: helpful
Use: Validate response helpfulness
Options: sales, support, product info, account management
Expected: product info
Use: Check topic categorization
Options: empathetic, professional, casual, robotic
Expected: empathetic, professional
Use: Assess tone appropriateness
Best for:
- Tone and sentiment validation
- Topic classification
- Multi-dimensional quality checks
- Intent verification
Test Case Best Practices
Focus on User Value
✓ Good: "User asks about Enterprise features - AI lists key differentiators"
✗ Avoid: "Test if GPT-4 embedding similarity > 0.85"
Write from user perspective, not technical implementation.
One Behavior Per Test
✓ Good: Separate tests for "pricing accuracy" and "tone appropriateness"
✗ Avoid: One test checking pricing, tone, format, and tool usage
Specific tests make failures easier to diagnose.
Use Realistic Conversations
Pull test cases from actual user interactions in Analyze → Conversations. Real questions beat invented scenarios.
Cover Edge Cases
- Misspellings: "enterprize plan"
- Vague requests: "I need help"
- Out of scope: "What's the weather?"
- Multi-part: "Tell me about pricing and also can I get a demo?"
Test Tool Usage
Include cases where AI should (and shouldn’t) use specific tools like search or escalation.
Running Tests
Once you have test cases, run the suite to evaluate current AI behavior.
Starting a Test Run
- Navigate to your test suite
- Click “Run Test Suite” in the test runs section
- Enter a name for this run
- Include version or purpose: “v0.5 pre-deploy”, “Post pricing update”
- Add description (optional)
- Note what changed since last run
- Document what you’re validating
- Click “Create Test Run”
The platform executes each test case against your current AI configuration:
- Sends the user message(s) to the AI
- Collects the AI’s response
- Evaluates against the check criteria
- Records pass/fail for each test
Test runs execute against your currently deployed AI behavior. Make sure you’ve built and deployed the version you want to test.
Test Execution
During execution:
- Status updates show progress through test cases
- Real-time results appear as tests complete
- Polling continues until all tests finish
- Duration varies based on suite size (typically 30-120 seconds)
Interpreting Results
Test run results show:
Overall Statistics
- Total test cases
- Passed count
- Failed count
- Pass rate percentage
Individual Test Results
For each test case:
- ✓ Passed - AI response met criteria
- ✗ Failed - AI response didn’t meet criteria
- Actual Response - What the AI said
- Evaluation Details - Why it passed or failed
Reviewing Failures
Click on a failed test to see:
- Input conversation - What was sent to the AI
- Expected outcome - What the check looked for
- Actual response - What the AI produced
- Evaluation reasoning - Why it didn’t match
Common Failure Patterns
Threshold Too High (Semantic Equivalence)
Expected: "Enterprise plan costs $500/month"
Actual: "Enterprise pricing is $500 monthly"
Threshold: 95%
Result: Failed at 88% similarity
Fix: Lower threshold to 85% or make expected value less specific
Pattern Too Strict (Regex Matches)
Pattern: \$500/month
Actual: "The cost is $500 per month"
Result: Failed (per month ≠ /month)
Fix: Update pattern to \$500.?(per month|/month|monthly)
Knowledge Gap
Expected: "Pro plan includes 50 seats"
Actual: "Pro plan is great for growing teams"
Result: Failed - missing specific information
Fix: Add seat information to knowledge base or guidance
Wrong Guidance Applied
Expected: Professional, concise answer
Actual: Casual, verbose response
Result: Failed tone check
Fix: Check audience filters - may be using wrong guidance rule
Test-Driven AI Development
Adopt a workflow that uses tests to guide improvements:
The TDD Cycle for AI
1. Write Test for Desired Behavior
Example: "User asks about mobile app availability"
Expected: "Our mobile app is available on iOS and Android.
Download from the App Store or Google Play."
2. Run Test (Expect Failure)
Result: Failed - AI says "I don't have information about mobile apps"
Reason: Knowledge gap
3. Make Minimum Change
Action: Add snippet about mobile app availability
Content: Platform supports mobile apps on iOS and Android,
available in respective app stores
4. Rebuild and Deploy
Build new version with updated knowledge
Deploy to testing environment
5. Run Test Again
Result: Passed - AI now provides correct information
6. Add Regression Test
Keep the test in suite to prevent future breaks
Progressive Test Coverage
Build your test suite incrementally:
Week 1: Core Functionality (10-15 tests)
- Most common user questions
- Critical product information
- Key workflows (signup, pricing, support)
Week 2: Edge Cases (5-10 tests)
- Unusual but important scenarios
- Multi-step interactions
- Context-dependent responses
Week 3: Tone and Quality (5-10 tests)
- Brand voice compliance
- Empathy in support scenarios
- Professional language
Week 4: Tool Usage (5-10 tests)
- When to search vs. use knowledge
- Escalation triggers
- Web fetch appropriateness
Ongoing: Regression Prevention
- Add test for every bug fix
- Cover new features as you launch
- Update tests when requirements change
Integrating with CI/CD
Use test suites as quality gates in your deployment pipeline.
Pre-Deployment Checklist
Before deploying a new AI version:
- Run all test suites against the new version
- Review failures - Are they expected changes or bugs?
- Update tests if requirements legitimately changed
- Fix issues if tests caught real problems
- Re-run until acceptable pass rate achieved
- Deploy with confidence
Recommended Pass Thresholds
Critical User Paths: 100% pass rate
- Signup flow, payment questions, account access
- Zero tolerance for failures in critical areas
General Support: 90%+ pass rate
- Minor variations acceptable
- Review failures to identify patterns
Experimental Features: 70%+ pass rate
- Early-stage capabilities
- Tests help guide refinement
Never deploy if critical path tests fail. A broken signup flow or payment process damages trust and revenue.
Automation Opportunities
While botBrains doesn’t currently offer API-triggered test runs, you can:
- Schedule manual runs before each deployment window
- Document test results in deployment notes
- Track pass rates over time in spreadsheet
- Alert team when rates drop below threshold
Future API support will enable fully automated CI/CD integration.
Best Practices
Coverage Strategies
Start with Happy Paths
Test the ideal user journey first:
1. User asks product question → AI provides accurate answer
2. User requests pricing → AI explains tiers clearly
3. User needs help → AI searches knowledge, then offers escalation
Add Sad Paths
Test error handling and edge cases:
1. User asks off-topic question → AI politely declines
2. User provides vague request → AI asks clarifying questions
3. AI doesn't find answer → AI offers escalation appropriately
Test Boundaries
Verify your AI stays in scope:
1. Request outside expertise → AI sets appropriate expectations
2. Confidential information request → AI refuses safely
3. Multiple questions → AI addresses all parts
Regression Test Creation
When to Add Regression Tests
Add a test whenever:
- User reports incorrect information
- Escalation should have happened but didn’t
- Response tone was inappropriate
- Tool was used incorrectly
- Edge case wasn’t handled
Regression Test Template
Name: [Bug ID] - [Brief description]
Description: Regression test for [issue]. Previously AI [wrong behavior],
should now [correct behavior].
Input: [Exact user message that triggered bug]
Check: [Verification that bug is fixed]
Example:
Name: BUG-123 - Student discount eligibility
Description: Regression test for student discount confusion. Previously AI
said students don't qualify, should clarify non-profit org discounts are
separate from education pricing.
Input: "Do you offer student discounts for non-profit organizations?"
Check: Semantically Equivalent
Expected: "We have separate programs: student discounts for individual
students in education, and non-profit organization discounts for registered
charities. Which applies to your situation?"
Threshold: 75%
Maintaining Test Suites
Regular Maintenance Tasks
Monthly: Review and Update
- Remove obsolete tests (deprecated features)
- Update expected values for intentional changes
- Add tests for new features
- Adjust thresholds based on performance
After Major Changes
- Guidance overhaul → Update all tone/style tests
- Knowledge migration → Update semantic equivalence tests
- Tool changes → Update tool usage tests
- Audience changes → Add segment-specific tests
When Tests Become Noise
Remove tests that:
- Fail inconsistently without clear pattern
- Test implementation details instead of user value
- Duplicate coverage of other tests
- Apply to removed features
A smaller suite with high-signal tests beats a large suite with noise. Quality over quantity.
Test Data Management
Sensitive Information
Never include in test cases:
- Real customer names, emails, or PII
- Actual account numbers or IDs
- Confidential business information
- Internal system details
Use placeholder data:
✗ Avoid: "What's the status of order #123456 for john.smith@company.com?"
✓ Better: "What's the status of my order?"
Realistic but Generic
✓ "I'm on the Pro plan and need to upgrade to Enterprise for my team of 50"
✓ "Our company is evaluating your product for customer support automation"
✓ "I'm getting an error when trying to connect my Zendesk account"
Common Testing Patterns
Pattern 1: Knowledge Accuracy Suite
Goal: Verify factual information is correct
Structure:
Test Suite: Product Knowledge - Pricing
├─ Monthly pricing for Pro plan
├─ Annual pricing with discount
├─ Enterprise custom pricing message
├─ Free tier limitations
└─ Trial period duration
Check Type: Semantically Equivalent (80% threshold)
When to Run: After any knowledge updates, before deployment
Pattern 2: Tone Consistency Suite
Goal: Ensure brand voice across scenarios
Structure:
Test Suite: Brand Voice - Professional & Empathetic
├─ Response to frustrated user
├─ Response to confused user
├─ Response to enthusiastic user
├─ Response to detailed technical question
└─ Response to simple question
Check Type: Classify As (tone classifications)
When to Run: After guidance changes, monthly quality check
Goal: Validate AI uses tools appropriately
Structure:
Test Suite: Tool Usage - Search & Escalation
├─ Should search knowledge base for documented feature
├─ Should NOT search for basic greeting
├─ Should escalate complex technical issue
├─ Should NOT escalate simple password reset
└─ Should offer handoff after failed resolution
Check Type: Mix of Matches and Classify As
When to Run: After tool configuration changes
Pattern 4: Multi-Turn Context Suite
Goal: Test context retention across conversation
Structure:
Test Suite: Context Handling
├─ Follow-up question references previous answer
├─ Pronoun resolution ("it", "that", "them")
├─ Topic switch mid-conversation
├─ Returning to previous topic
└─ Contradictory information handling
Check Type: Semantically Equivalent
When to Run: After model updates, monthly
Pattern 5: Segment-Specific Suite
Goal: Validate audience-targeted behavior
Structure:
Test Suite: Enterprise Customer Experience
├─ Enterprise user gets priority escalation offer
├─ Enterprise user sees advanced features
├─ Enterprise user gets dedicated support mention
├─ Free user doesn't get enterprise messaging
└─ Free user sees appropriate upgrade path
Check Type: Mix of Semantically Equivalent and Classify As
When to Run: After audience/guidance changes
Troubleshooting Test Failures
Inconsistent Results
Problem: Same test passes sometimes, fails others
Common Causes:
- Threshold too close to borderline (e.g., 75% when responses vary 70-80%)
- AI response has acceptable variation in wording
- Check criteria too strict for creative responses
Solutions:
1. Lower semantic threshold by 5-10%
2. Use pattern matching for must-have phrases, semantic for concepts
3. Accept that some variation is normal - update expected value to be more general
4. If truly random, this may indicate guidance needs to be more specific
Unexpected Passes
Problem: Test passes but manual review shows poor quality
Common Causes:
- Expected value too vague
- Threshold too low
- Check type doesn’t match what you’re validating
Solutions:
1. Make expected value more specific and detailed
2. Increase semantic threshold
3. Switch check type (e.g., semantic → pattern match for exact phrases)
4. Add multiple checks - one for content, one for tone
All Tests Failing
Problem: Entire suite suddenly fails
Common Causes:
- Wrong AI version deployed
- Guidance deactivated
- Required tools disabled
- Knowledge source removed
Solutions:
1. Check deployed version - is it what you intended to test?
2. Verify guidance rules are active
3. Confirm tools are enabled
4. Review knowledge sources and data providers
5. Check for recent configuration changes
False Negatives
Problem: Test fails but response is actually good
Common Causes:
- Expected value doesn’t match how AI naturally phrases things
- Threshold too high
- Pattern too specific
Solutions:
1. Update expected value to match AI's phrasing style
2. Reduce threshold to 70-80% for concept matching
3. Broaden regex pattern to accept variations
4. Consider if test expectations are realistic
Measuring Test Suite Health
Track these metrics to ensure your testing remains effective:
Coverage Metrics
Critical Path Coverage
(Test cases covering critical scenarios / Total critical scenarios) × 100
Target: 100%
Feature Coverage
(Features with test cases / Total features) × 100
Target: 80%+
Tool Coverage
(Tools with usage tests / Total tools enabled) × 100
Target: 100%
Quality Metrics
Pass Rate Trend
Track over time - should remain stable or improve
Declining pass rate indicates quality issues or outdated tests
False Positive Rate
(Tests passing that should fail / Total tests) × 100
Target: <5%
False Negative Rate
(Tests failing that should pass / Total tests) × 100
Target: <10%
Test Maintenance Frequency
Tests updated per deployment
Target: Review 10-20% of tests per major deployment
Next Steps
Now that you understand AI testing:
Testing is continuous. As your AI evolves and user needs change, your test suite should evolve too. The investment in testing pays dividends in quality, reliability, and peace of mind.