Monitoring, Evaluation, and Improvement

Monitor AI agent performance in PromptOwl with evaluation sets, AI Judge scoring, annotations, and automated prompt improvement workflows.

This guide explains how to monitor your agents' performance in PromptOwl, collect and analyze feedback, run systematic evaluations, and use AI to continuously improve your prompts.

The Improvement Lifecycle

Continuous improvement follows a cyclical process:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  DEPLOY  │────▶│  MONITOR │────▶│ COLLECT  │          │
│    │  Agent   │     │  Usage   │     │ Feedback │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│         ▲                                  │               │
│         │                                  ▼               │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  PUBLISH │◀────│ IMPROVE  │◀────│ EVALUATE │          │
│    │  Update  │     │  Prompt  │     │  Quality │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Six Steps

Step

Action

Purpose

1. Deploy

Publish your agent to production

Make available to users

2. Monitor

Track usage, tokens, conversations

Understand real-world behavior

3. Collect

Gather annotations and feedback

Identify improvement areas

4. Evaluate

Test systematically with eval sets

Measure quality objectively

5. Improve

Use AI to refine prompts

Generate better versions

6. Publish

Release improvements

Complete the cycle

Collecting Quality Feedback

Types of Feedback

PromptOwl supports two levels of feedback:

1. Message-Level Annotations

Feedback on individual AI responses:

When to use: Rating specific answers
Components: Sentiment (thumbs up/down) + detailed text
Best for: Identifying specific issues

User: "What's your return policy?"
AI: "Our return policy allows returns within 30 days..."

[👍 Thumbs Up]
Annotation: "Accurate and complete answer. Good tone."

2. Conversation-Level Annotations

Feedback on entire conversations:

When to use: Rating overall experience
Components: Sentiment + summary feedback
Best for: Holistic assessment

Overall Conversation Rating: [👎 Thumbs Down]
Annotation: "Agent was helpful but took too many turns to resolve the issue."

Encouraging Quality Feedback

For Internal Teams

Train your team to provide actionable annotations:

Good annotation:

"The response was accurate but too technical for our customer audience. Should use simpler language and avoid jargon like 'API endpoint'."

Poor annotation:

"Bad response"

Annotation Guidelines

Don't

Be specific about what's wrong

Use vague terms like "bad" or "wrong"

Suggest how to improve

Just criticize without direction

Note what worked well

Only focus on negatives

Include context if relevant

Assume reader knows the situation

Sentiment Best Practices

Use sentiment consistently:

Sentiment

When to Use

👍 Positive

Response was helpful, accurate, appropriate

👎 Negative

Response was wrong, unhelpful, inappropriate

— Neutral

Response was acceptable but could be better

Monitoring Performance

Accessing the Monitor

Open your prompt from the Dashboard
Click Monitor in the top navigation
View the conversation history and analytics

Monitor Interface Overview

The Monitor has two main tabs:

History Tab

Shows all conversations with your agent:

Column

Description

User

Who had the conversation

Start Time

When it began

Duration

How long it lasted

Source

Where it came from

Topic

Conversation subject

All Annotations Tab

Aggregates all feedback:

Column

Description

User

Who provided feedback

Question

What was asked

Response

What the AI answered

Annotation

The feedback text

Sentiment

Thumbs up/down/neutral

Date

When feedback was given

Key Metrics to Track

Conversation Metrics

Total Conversations: Overall usage volume
Average Duration: How long users engage
Messages per Conversation: Conversation depth

Quality Metrics

Satisfaction Score: Average sentiment (0-100)
Positive Rate: Percentage of thumbs up
Annotation Volume: How much feedback collected

Usage Metrics

Total Tokens Used: API consumption
Model Distribution: Which models are used
Peak Usage Times: When most active

Filtering and Search

Find specific conversations:

Search: Type to filter by user, topic, or content
Date Range: Focus on specific time periods
Eval Filter: Show only test conversations

Viewing Conversation Details

Click any conversation to see:

Complete message history
User questions and AI responses
Block outputs (for sequential/supervisor)
Citations shown
Annotations provided

Creating Evaluation Sets

Evaluation sets (Eval Sets) are collections of test cases used to systematically measure prompt quality.

What's in an Eval Set?

Each eval set contains:

Component

Description

Required

Name

Descriptive identifier

Yes

Input

Test question/query

Yes

Expected Output

Desired response (annotation)

Recommended

Method 1: Create from Annotations

Convert quality feedback into test cases:

Go to Monitor → All Annotations tab
Check boxes next to high-quality annotations
Click Save to Eval Set
Name your eval set
Click Create

Best annotations for eval sets:

Clear positive examples (what good looks like)
Clear negative examples (what to avoid)
Edge cases users actually encountered
Common question patterns

Method 2: Upload CSV

Import test cases from a spreadsheet:

Go to Eval tab on your prompt
Click Upload CSV
Select your file with columns:
- question or input - The test query
- result or annotation - Expected response
Name the eval set
Click Upload

CSV Format Example:

question,result
"What is your return policy?","We accept returns within 30 days of purchase..."
"How do I reset my password?","Click 'Forgot Password' on the login page..."
"What are your business hours?","We're open Monday-Friday 9am-5pm EST..."

Method 3: Manual Entry

Add test cases one at a time:

Go to Eval tab
Click Add Test Case
Enter the input question
Enter the expected output
Save

Organizing Eval Sets

Create multiple eval sets for different purposes:

Eval Set Name

Purpose

"Core FAQs"

Basic functionality testing

"Edge Cases"

Unusual or difficult queries

"Regression Tests"

Ensure fixes don't break

"New Feature Tests"

Validate new capabilities

Running Evaluations

The Evaluation Process

Eval Set (Test Cases)
        ↓
┌───────────────────────────────────┐
│     Select Prompt Version         │
│   (v1, v2, production, etc.)      │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│      Run Each Test Case           │
│   Input → Prompt → Response       │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│       Store Results               │
│   Save all responses for review   │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│       Compare & Analyze           │
│   Review outputs vs expected      │
└───────────────────────────────────┘

Running an Evaluation

Go to your prompt's Eval tab
Select an Eval Set from the dropdown
Select the Version to test
Click Run Eval
Wait for all test cases to complete
Review results in the table

Understanding Results

After running, you'll see:

Column

Description

Input

The test question

Expected Output

What you wanted

Eval Response

What the AI actually said

Version

Which version was tested

Comparing Versions

Test multiple versions to find the best:

Run eval with Version 1
Make prompt improvements → Create Version 2
Run eval with Version 2
Compare results in the Runs tab

The Runs tab shows historical comparisons:

Eval Set

Version

Date

Count

Avg Score

Pass Rate

Core FAQs

Dec 28

3.2

60%

Core FAQs

Dec 29

4.1

84%

Core FAQs

Dec 30

4.5

92%

![Screenshot: Runs Comparison]

Using AI Judge

AI Judge automatically evaluates response quality using another AI model.

What AI Judge Does

Instead of manually reviewing every response, AI Judge:

Reads the input question
Reads the expected output
Reads the actual response
Scores quality (1-5 scale)
Explains its reasoning

Judge Evaluation Criteria

The AI Judge evaluates based on:

Criterion

Description

Accuracy

Does the response match expected output?

Completeness

Does it cover all necessary points?

Quality

Is it clear, helpful, and well-structured?

Scoring Scale

Score

Meaning

Description

Excellent

Exceeds expectations

Good

Meets expectations well

Average

Acceptable, room for improvement

Below Average

Missing key elements

Poor

Significantly wrong or unhelpful

Running Judge Evaluation

First, run a regular evaluation
Click Run Judge
Wait for judge to score all results
Review scores and reasoning

![Screenshot: Run Judge Button]

Understanding Judge Output

After judging, you'll see:

Column

Description

Judge

AI's reasoning for the score

Score

Numeric rating (1-5)

Example Judge Output:

Score: 4
Reasoning: The response accurately addresses the return policy question
and includes the 30-day window. It could be improved by mentioning
the refund method (original payment method). Overall, a good response
that covers the essential information.

Aggregate Metrics

Judge evaluation calculates:

Average Score: Mean of all test case scores
Pass Rate: Percentage scoring 3 or above

These appear in the Runs tab for easy comparison across versions.

AI-Assisted Improvement

PromptOwl can use AI to suggest improvements based on real conversations and feedback.

Accessing Improve with AI

Open your prompt in edit mode
Click Improve with AI button
The improvement dialog opens

![Screenshot: Improve Button]

Improvement Sources

The AI can analyze:

1. Example Conversations

Paste or auto-load real conversations:

User: "How do I cancel my subscription?"
Assistant: "To cancel, go to Settings > Billing > Cancel Plan."

User: "But I don't see a Billing option"
Assistant: "The Billing option is under Account Settings..."

2. User Feedback

Your specific improvement requests:

Feedback: "The agent takes too many turns to answer simple questions.
Should provide complete answers in the first response.
Also needs to be more empathetic when users are frustrated."

3. Annotations

Auto-loaded feedback from production:

Message-level annotations
Conversation-level notes
Sentiment data

How Improvement Works

┌─────────────────────────────────────────────────────────────┐
│                     IMPROVEMENT PROCESS                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Current   │    │   Example   │    │    User     │     │
│  │   Prompt    │ +  │ Conversation│ +  │  Feedback   │     │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘     │
│         │                  │                  │             │
│         └──────────────────┼──────────────────┘             │
│                            ↓                                │
│              ┌─────────────────────────┐                    │
│              │    AI Analysis Engine    │                    │
│              │  (Prompt-type specific)  │                    │
│              └────────────┬────────────┘                    │
│                           ↓                                 │
│              ┌─────────────────────────┐                    │
│              │   Improved Variations   │                    │
│              │   + Change Explanations │                    │
│              └─────────────────────────┘                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Prompt-Type Specific Improvements

AI understands your prompt type and optimizes accordingly:

Simple Prompts

Focus areas:

Clarity of instructions
Response format consistency
Tone and personality
Edge case handling

Sequential Prompts

Focus areas:

Block efficiency and necessity
Data flow between blocks
Inter-step coordination
Output quality at each stage

Supervisor Prompts

Focus areas:

Task routing logic
Agent specialization clarity
Delegation decisions
Response synthesis

Applying Improvements

Improvements are applied non-destructively:

Block Names: Updated if AI suggests better names
Content: Saved as new variations (original preserved)
Testing: A/B test original vs improved

![Screenshot: Improvement Results]

Improvement-to-Evaluation Loop

After applying improvements:

New variations are created
Run evaluation against the new version
Compare scores with previous version
If improved, publish; if not, iterate

Best Practices

Annotation Best Practices

Collecting Effective Feedback

Practice

Why It Matters

Annotate immediately

Details are fresh

Be specific

Vague feedback isn't actionable

Include examples

Show what you expected

Note patterns

Same issue multiple times = priority

Building Annotation Culture

Train team on annotation guidelines
Celebrate improvements from feedback
Share "before/after" examples
Make annotation part of workflow

Monitoring Best Practices

What to Monitor Daily

Conversation volume trends
Negative sentiment spikes
Unusual error patterns
Token usage anomalies

What to Review Weekly

Aggregate satisfaction scores
Common failure patterns
Top annotation themes
Version performance comparison

Setting Up Alerts

Consider monitoring for:

Satisfaction score drops below threshold
Unusual volume changes
High negative annotation rates

Evaluation Best Practices

Building Quality Eval Sets

Guideline

Implementation

Cover common cases

70% should be typical questions

Include edge cases

20% should test boundaries

Add failure scenarios

10% should test error handling

Update regularly

Add new cases from production

Running Effective Evaluations

Baseline first: Eval current version before changes
One variable: Change one thing, then eval
Statistical significance: Use enough test cases (25+)
Judge consistently: Use same judge prompt

Interpreting Results

Metric

Target

Action if Below

Pass Rate

>80%

Review failing cases

Avg Score

>4.0

Focus on low scorers

Consistency

Low variance

Investigate outliers

Improvement Best Practices

When to Use AI Improvement

After identifying patterns in annotations
When manual tweaking isn't working
To get fresh perspective on prompt
When scaling improvement efforts

What to Include in Feedback

Helpful feedback:

"Users are asking about refunds but the agent only mentions returns.
Need to distinguish between refund (money back) and return (exchange).
Should ask clarifying question if unclear which they want."

Not helpful:

"Make it better"

Validating Improvements

Don't trust blindly: AI suggestions need human review
Test systematically: Run eval before publishing
Start with one change: Don't apply all suggestions at once
Monitor after publish: Watch for regressions

Continuous Improvement Workflow

Daily

Check Monitor for new conversations
Review any negative annotations
Note patterns for later

Weekly

Review aggregate metrics
Analyze annotation themes
Prioritize improvements needed
Run eval on current version

Monthly

Add new test cases from production
Compare version performance trends
Archive outdated eval sets
Document what's been learned

Troubleshooting

Annotations not appearing

Check annotation feature is enabled (Enterprise Settings)
Verify user has permission to annotate
Refresh the Monitor view
Check correct prompt is selected

Evaluation failing

Verify eval set has test cases
Check prompt version exists
Ensure API keys are configured
Try running single test case first

Judge scores seem wrong

Review judge prompt (uses production version)
Check expected outputs are realistic
Verify input/output alignment in eval set
Consider adjusting pass threshold

AI improvement suggestions unhelpful

Provide more specific feedback
Include actual conversation examples
Describe what good looks like
Try with different conversation samples

Metrics not updating

Refresh the page
Check date range filters
Verify conversations are completed
Allow time for aggregation

Quick Reference

Keyboard Shortcuts

Action

Shortcut

Refresh Monitor

Ctrl/Cmd + R

Ctrl/Cmd + F

Save annotation

Enter (in modal)

Key Metrics Glossary

Metric

Formula

Good Target

Pass Rate

(Scores ≥ 3) / Total

>80%

Avg Score

Sum(Scores) / Count

>4.0

Satisfaction

Positive / (Positive + Negative)

>85%

Eval Set Size Guidelines

Purpose

Recommended Size

Quick check

10-15 cases

Standard eval

25-50 cases

Comprehensive

100+ cases

PreviousUnderstanding Agents and RAG NextEnterprise Setup and White-Label

Last updated 1 month ago

Good evening

hashtagTable of Contents

hashtagThe Improvement Lifecycle

hashtagThe Six Steps

hashtagCollecting Quality Feedback

hashtagTypes of Feedback

hashtag1. Message-Level Annotations

hashtag2. Conversation-Level Annotations

hashtagEncouraging Quality Feedback

hashtagFor Internal Teams

hashtagAnnotation Guidelines

hashtagSentiment Best Practices

hashtagMonitoring Performance

hashtagAccessing the Monitor

hashtagMonitor Interface Overview

hashtagHistory Tab

hashtagAll Annotations Tab

hashtagKey Metrics to Track

hashtagConversation Metrics

hashtagQuality Metrics

hashtagUsage Metrics

hashtagFiltering and Search

hashtagViewing Conversation Details

hashtagCreating Evaluation Sets

hashtagWhat's in an Eval Set?

hashtagMethod 1: Create from Annotations

hashtagMethod 2: Upload CSV

hashtagMethod 3: Manual Entry

hashtagOrganizing Eval Sets

hashtagRunning Evaluations

hashtagThe Evaluation Process

hashtagRunning an Evaluation

hashtagUnderstanding Results

hashtagComparing Versions

hashtagUsing AI Judge

hashtagWhat AI Judge Does

hashtagJudge Evaluation Criteria

hashtagScoring Scale

hashtagRunning Judge Evaluation

hashtagUnderstanding Judge Output

hashtagAggregate Metrics

hashtagAI-Assisted Improvement

hashtagAccessing Improve with AI

hashtagImprovement Sources

hashtag1. Example Conversations

hashtag2. User Feedback

hashtag3. Annotations

hashtagHow Improvement Works

hashtagPrompt-Type Specific Improvements

hashtagSimple Prompts

hashtagSequential Prompts

hashtagSupervisor Prompts

hashtagApplying Improvements

hashtagImprovement-to-Evaluation Loop

hashtagBest Practices

hashtagAnnotation Best Practices

hashtagCollecting Effective Feedback

hashtagBuilding Annotation Culture

hashtagMonitoring Best Practices

hashtagWhat to Monitor Daily

hashtagWhat to Review Weekly

hashtagSetting Up Alerts

hashtagEvaluation Best Practices

hashtagBuilding Quality Eval Sets

hashtagRunning Effective Evaluations

hashtagInterpreting Results

hashtagImprovement Best Practices

hashtagWhen to Use AI Improvement

hashtagWhat to Include in Feedback

hashtagValidating Improvements

hashtagContinuous Improvement Workflow

hashtagDaily

hashtagWeekly

hashtagMonthly

hashtagTroubleshooting

hashtagAnnotations not appearing

hashtagEvaluation failing

hashtagJudge scores seem wrong

hashtagAI improvement suggestions unhelpful

hashtagMetrics not updating

Table of Contents

The Improvement Lifecycle

The Six Steps

Collecting Quality Feedback

Types of Feedback

1. Message-Level Annotations

2. Conversation-Level Annotations

Encouraging Quality Feedback

For Internal Teams

Annotation Guidelines

Sentiment Best Practices

Monitoring Performance

Accessing the Monitor

Monitor Interface Overview

History Tab

All Annotations Tab

Key Metrics to Track

Conversation Metrics

Quality Metrics

Usage Metrics

Filtering and Search

Viewing Conversation Details

Creating Evaluation Sets

What's in an Eval Set?

Method 1: Create from Annotations

Method 2: Upload CSV

Method 3: Manual Entry

Organizing Eval Sets

Running Evaluations

The Evaluation Process

Running an Evaluation

Understanding Results

Comparing Versions

Using AI Judge

What AI Judge Does

Judge Evaluation Criteria

Scoring Scale

Running Judge Evaluation

Understanding Judge Output

Aggregate Metrics

AI-Assisted Improvement

Accessing Improve with AI

Improvement Sources

1. Example Conversations

2. User Feedback

3. Annotations

How Improvement Works

Prompt-Type Specific Improvements

Simple Prompts

Sequential Prompts

Supervisor Prompts

Applying Improvements

Improvement-to-Evaluation Loop

Best Practices

Annotation Best Practices

Collecting Effective Feedback

Building Annotation Culture

Monitoring Best Practices

What to Monitor Daily

What to Review Weekly

Setting Up Alerts

Evaluation Best Practices

Building Quality Eval Sets

Running Effective Evaluations

Interpreting Results

Improvement Best Practices

When to Use AI Improvement

What to Include in Feedback

Validating Improvements

Continuous Improvement Workflow

Daily

Weekly

Monthly

Troubleshooting

Annotations not appearing

Evaluation failing

Judge scores seem wrong

AI improvement suggestions unhelpful

Metrics not updating

Quick Reference