Testing AI Agents
Testing AI agents is fundamentally different from testing traditional software. Your AI doesn’t follow fixed logic paths - it generates responses dynamically based on instructions, knowledge, and context. Test suites help you verify that changes to guidance, knowledge, or tools improve your AI without breaking existing capabilities.Why Testing AI Agents Matters
Unlike conventional software where bugs are binary (works or doesn’t), AI behavior exists on a quality spectrum. Testing helps you:- Maintain quality standards - Ensure responses meet your accuracy and tone requirements
- Prevent regressions - Catch when updates break working functionality
- Build with confidence - Deploy changes knowing they won’t degrade user experience
- Document expected behavior - Test cases serve as living specifications
- Track improvements - Measure progress as you refine guidance and knowledge
- Validate before production - Catch issues in development, not in customer conversations
Understanding Test Suites
A test suite is a collection of test cases that validate specific aspects of your AI’s behavior. You organize test cases into suites based on:- Feature area - Product questions, billing inquiries, technical support
- User segment - Enterprise customers, free users, trial accounts
- Integration channel - Website chat, Zendesk, Slack, Salesforce
- Release cycle - Pre-deployment checks, regression tests, smoke tests
Anatomy of a Test Suite
Each test suite contains: Name and Description- Clear label indicating what you’re testing
- Description explaining the suite’s purpose
- Examples: “Sales Inquiries”, “Enterprise Customer Support”, “Product Knowledge”
- Individual scenarios you want to validate
- Each case tests one specific behavior or capability
- Cases can have conversation history for context
- Executions of all test cases in the suite
- Creates a snapshot of pass/fail results
- Tracks performance over time
Creating Test Suites
Navigate to Test Suites in the platform sidebar to manage your tests.Creating Your First Suite
- Click “Create Your First Test Suite” (or “New Test Suite” if you have existing suites)
- Enter a descriptive name
- Good: “Product Knowledge - Pricing Questions”
- Avoid: “Test Suite 1”
- Write a clear description
- Explain what behavior this suite validates
- Include context about which guidance or knowledge it covers
- Save the suite
Start with a small suite (5-10 test cases) focused on your most critical user interactions. Expand as you identify more patterns to test.
Organizing Multiple Suites
As your testing grows, organize suites by: Critical User PathsWriting Effective Test Cases
Test cases define the scenarios your AI should handle correctly. Each case specifies input (user messages) and expected output (evaluation criteria).Creating a Test Case
Navigate to a test suite and click “Add Test Case”: Name- Describe the scenario being tested
- Be specific about what you’re validating
- Example: “Explain monthly pricing with discount code”
- Add context about why this test matters
- Document edge cases or special conditions
- Link to related conversations or issues
- Build the context leading to the test
- Add message pairs (user → assistant → user)
- Final user message is what gets evaluated
- Define how to evaluate the response
- Choose the appropriate check type
- Set thresholds and expected values
Crafting User Queries
The conversation input should represent realistic user interactions: Single Turn TestsChoosing Check Types
botBrains offers four evaluation methods:Semantically Equivalent
Use when: You want the AI to convey specific information, but exact wording doesn’t matter. Configuration:- Expected Value: The core message the response should contain
- Threshold: Similarity percentage (60-100%)
- Knowledge accuracy tests
- Factual information validation
- Concept explanation checks
Matches Pattern
Use when: The response must contain specific keywords, phrases, or formatting. Configuration:- Pattern: Regular expression to match against response
Verify pricing format
Verify pricing format
Ensure plan tiers are referenced
Ensure plan tiers are referenced
Validate acknowledgment
Validate acknowledgment
- Format validation (dates, prices, codes)
- Required phrase inclusion
- Structured response checks
- Compliance language verification
Not Matches Pattern
Use when: The response should NOT contain certain content. Configuration:- Pattern: Regular expression that should NOT appear in response
Prevent inappropriate uncertainty
Prevent inappropriate uncertainty
Verify no template text
Verify no template text
Prevent information leakage
Prevent information leakage
- Prohibited content detection
- Tone/voice violations
- Information security checks
- Avoiding specific phrases
Classify As
Use when: The response should be categorized into predefined classes. Configuration:- Options: List of possible classifications (comma-separated)
- Expected: Which classifications should apply (can be multiple)
Validate response helpfulness
Validate response helpfulness
Check topic categorization
Check topic categorization
Assess tone appropriateness
Assess tone appropriateness
- Tone and sentiment validation
- Topic classification
- Multi-dimensional quality checks
- Intent verification
Test Case Best Practices
Focus on User ValueRunning Tests
Once you have test cases, run the suite to evaluate current AI behavior.Starting a Test Run
- Navigate to your test suite
- Click “Run Test Suite” in the test runs section
- Enter a name for this run
- Include version or purpose: “v0.5 pre-deploy”, “Post pricing update”
- Add description (optional)
- Note what changed since last run
- Document what you’re validating
- Click “Create Test Run”
- Sends the user message(s) to the AI
- Collects the AI’s response
- Evaluates against the check criteria
- Records pass/fail for each test
Test runs execute against your currently deployed AI behavior. Make sure you’ve built and deployed the version you want to test.
Test Execution
During execution:- Status updates show progress through test cases
- Real-time results appear as tests complete
- Polling continues until all tests finish
- Duration varies based on suite size (typically 30-120 seconds)
Interpreting Results
Test run results show: Overall Statistics- Total test cases
- Passed count
- Failed count
- Pass rate percentage
- ✓ Passed - AI response met criteria
- ✗ Failed - AI response didn’t meet criteria
- Actual Response - What the AI said
- Evaluation Details - Why it passed or failed
- Input conversation - What was sent to the AI
- Expected outcome - What the check looked for
- Actual response - What the AI produced
- Evaluation reasoning - Why it didn’t match
Common Failure Patterns
Threshold Too High (Semantic Equivalence)
Threshold Too High (Semantic Equivalence)
Pattern Too Strict (Regex Matches)
Pattern Too Strict (Regex Matches)
Knowledge Gap
Knowledge Gap
Wrong Guidance Applied
Wrong Guidance Applied
Test-Driven AI Development
Adopt a workflow that uses tests to guide improvements:The TDD Cycle for AI
1. Write Test for Desired BehaviorProgressive Test Coverage
Build your test suite incrementally: Week 1: Core Functionality (10-15 tests)- Most common user questions
- Critical product information
- Key workflows (signup, pricing, support)
- Unusual but important scenarios
- Multi-step interactions
- Context-dependent responses
- Brand voice compliance
- Empathy in support scenarios
- Professional language
- When to search vs. use knowledge
- Escalation triggers
- Web fetch appropriateness
- Add test for every bug fix
- Cover new features as you launch
- Update tests when requirements change
Integrating with CI/CD
Use test suites as quality gates in your deployment pipeline.Pre-Deployment Checklist
Before deploying a new AI version:- Run all test suites against the new version
- Review failures - Are they expected changes or bugs?
- Update tests if requirements legitimately changed
- Fix issues if tests caught real problems
- Re-run until acceptable pass rate achieved
- Deploy with confidence
Recommended Pass Thresholds
Critical User Paths: 100% pass rate- Signup flow, payment questions, account access
- Zero tolerance for failures in critical areas
- Minor variations acceptable
- Review failures to identify patterns
- Early-stage capabilities
- Tests help guide refinement
Automation Opportunities
While botBrains doesn’t currently offer API-triggered test runs, you can:- Schedule manual runs before each deployment window
- Document test results in deployment notes
- Track pass rates over time in spreadsheet
- Alert team when rates drop below threshold
Best Practices
Coverage Strategies
Start with Happy Paths Test the ideal user journey first:Regression Test Creation
When to Add Regression Tests Add a test whenever:- User reports incorrect information
- Escalation should have happened but didn’t
- Response tone was inappropriate
- Tool was used incorrectly
- Edge case wasn’t handled
Maintaining Test Suites
Regular Maintenance Tasks Monthly: Review and Update- Remove obsolete tests (deprecated features)
- Update expected values for intentional changes
- Add tests for new features
- Adjust thresholds based on performance
- Guidance overhaul → Update all tone/style tests
- Knowledge migration → Update semantic equivalence tests
- Tool changes → Update tool usage tests
- Audience changes → Add segment-specific tests
- Fail inconsistently without clear pattern
- Test implementation details instead of user value
- Duplicate coverage of other tests
- Apply to removed features
Test Data Management
Sensitive Information Never include in test cases:- Real customer names, emails, or PII
- Actual account numbers or IDs
- Confidential business information
- Internal system details
Common Testing Patterns
Knowledge Accuracy Suite
Knowledge Accuracy Suite
Goal: Verify factual information is correctStructure:Check Type: Semantically Equivalent (80% threshold)When to Run: After any knowledge updates, before deployment
Tone Consistency Suite
Tone Consistency Suite
Goal: Ensure brand voice across scenariosStructure:Check Type: Classify As (tone classifications)When to Run: After guidance changes, monthly quality check
Tool Usage Suite
Tool Usage Suite
Goal: Validate AI uses tools appropriatelyStructure:Check Type: Mix of Matches and Classify AsWhen to Run: After tool configuration changes
Multi-Turn Context Suite
Multi-Turn Context Suite
Goal: Test context retention across conversationStructure:Check Type: Semantically EquivalentWhen to Run: After model updates, monthly
Segment-Specific Suite
Segment-Specific Suite
Goal: Validate audience-targeted behaviorStructure:Check Type: Mix of Semantically Equivalent and Classify AsWhen to Run: After audience/guidance changes
Frequently Asked Questions
Inconsistent Results
Inconsistent Results
Problem: Same test passes sometimes, fails othersCommon Causes:
- Threshold too close to borderline (e.g., 75% when responses vary 70-80%)
- AI response has acceptable variation in wording
- Check criteria too strict for creative responses
Unexpected Passes
Unexpected Passes
Problem: Test passes but manual review shows poor qualityCommon Causes:
- Expected value too vague
- Threshold too low
- Check type doesn’t match what you’re validating
All Tests Failing
All Tests Failing
Problem: Entire suite suddenly failsCommon Causes:
- Wrong AI version deployed
- Guidance deactivated
- Required tools disabled
- Knowledge source removed
False Negatives
False Negatives
Problem: Test fails but response is actually goodCommon Causes:
- Expected value doesn’t match how AI naturally phrases things
- Threshold too high
- Pattern too specific
Measuring Test Suite Health
Track these metrics to ensure your testing remains effective:Coverage Metrics
Critical Path CoverageQuality Metrics
Pass Rate TrendNext Steps
Now that you understand AI testing:- Improve Answers - Use test failures to guide improvements
- Instruct AI Agent - Refine guidance to pass more tests
- Add Knowledge - Fill gaps revealed by failing tests
- Deploy Changes - Ship with confidence after tests pass