Monitoring, Evaluation, and Improvement

Monitor AI agent performance in PromptOwl with evaluation sets, AI Judge scoring, annotations, and automated prompt improvement workflows.

This guide explains how to monitor your agents' performance in PromptOwlarrow-up-right, collect and analyze feedback, run systematic evaluations, and use AI to continuously improve your prompts.


Table of Contents


The Improvement Lifecycle

Continuous improvement follows a cyclical process:

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  DEPLOY  │────▶│  MONITOR │────▶│ COLLECT  │          │
│    │  Agent   │     │  Usage   │     │ Feedback │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│         ▲                                  │               │
│         │                                  ▼               │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  PUBLISH │◀────│ IMPROVE  │◀────│ EVALUATE │          │
│    │  Update  │     │  Prompt  │     │  Quality │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Six Steps

Step
Action
Purpose

1. Deploy

Publish your agent to production

Make available to users

2. Monitor

Track usage, tokens, conversations

Understand real-world behavior

3. Collect

Gather annotations and feedback

Identify improvement areas

4. Evaluate

Test systematically with eval sets

Measure quality objectively

5. Improve

Use AI to refine prompts

Generate better versions

6. Publish

Release improvements

Complete the cycle


Collecting Quality Feedback

Types of Feedback

PromptOwl supports two levels of feedback:

1. Message-Level Annotations

Feedback on individual AI responses:

  • When to use: Rating specific answers

  • Components: Sentiment (thumbs up/down) + detailed text

  • Best for: Identifying specific issues

2. Conversation-Level Annotations

Feedback on entire conversations:

  • When to use: Rating overall experience

  • Components: Sentiment + summary feedback

  • Best for: Holistic assessment

Encouraging Quality Feedback

For Internal Teams

Train your team to provide actionable annotations:

Good annotation:

"The response was accurate but too technical for our customer audience. Should use simpler language and avoid jargon like 'API endpoint'."

Poor annotation:

"Bad response"

Annotation Guidelines

Do
Don't

Be specific about what's wrong

Use vague terms like "bad" or "wrong"

Suggest how to improve

Just criticize without direction

Note what worked well

Only focus on negatives

Include context if relevant

Assume reader knows the situation

Screenshot: Annotation Modal

Sentiment Best Practices

Use sentiment consistently:

Sentiment
When to Use

👍 Positive

Response was helpful, accurate, appropriate

👎 Negative

Response was wrong, unhelpful, inappropriate

— Neutral

Response was acceptable but could be better


Monitoring Performance

Accessing the Monitor

  1. Open your prompt from the Dashboard

  2. Click Monitor in the top navigation

  3. View the conversation history and analytics

Screenshot: Monitor Navigation

Monitor Interface Overview

The Monitor has two main tabs:

History Tab

Shows all conversations with your agent:

Column
Description

User

Who had the conversation

Start Time

When it began

Duration

How long it lasted

Source

Where it came from

Topic

Conversation subject

Screenshot: History Tab

All Annotations Tab

Aggregates all feedback:

Column
Description

User

Who provided feedback

Question

What was asked

Response

What the AI answered

Annotation

The feedback text

Sentiment

Thumbs up/down/neutral

Date

When feedback was given

Screenshot: All Annotations Tab

Key Metrics to Track

Conversation Metrics

  • Total Conversations: Overall usage volume

  • Average Duration: How long users engage

  • Messages per Conversation: Conversation depth

Quality Metrics

  • Satisfaction Score: Average sentiment (0-100)

  • Positive Rate: Percentage of thumbs up

  • Annotation Volume: How much feedback collected

Usage Metrics

  • Total Tokens Used: API consumption

  • Model Distribution: Which models are used

  • Peak Usage Times: When most active

Find specific conversations:

  1. Search: Type to filter by user, topic, or content

  2. Date Range: Focus on specific time periods

  3. Eval Filter: Show only test conversations

Screenshot: Monitor Filters

Viewing Conversation Details

Click any conversation to see:

  • Complete message history

  • User questions and AI responses

  • Block outputs (for sequential/supervisor)

  • Citations shown

  • Annotations provided

Screenshot: Conversation Detail

Creating Evaluation Sets

Evaluation sets (Eval Sets) are collections of test cases used to systematically measure prompt quality.

What's in an Eval Set?

Each eval set contains:

Component
Description
Required

Name

Descriptive identifier

Yes

Input

Test question/query

Yes

Expected Output

Desired response (annotation)

Recommended

Method 1: Create from Annotations

Convert quality feedback into test cases:

  1. Go to MonitorAll Annotations tab

  2. Check boxes next to high-quality annotations

  3. Click Save to Eval Set

  4. Name your eval set

  5. Click Create

Screenshot: Save to Eval Set

Best annotations for eval sets:

  • Clear positive examples (what good looks like)

  • Clear negative examples (what to avoid)

  • Edge cases users actually encountered

  • Common question patterns

Method 2: Upload CSV

Import test cases from a spreadsheet:

  1. Go to Eval tab on your prompt

  2. Click Upload CSV

  3. Select your file with columns:

    • question or input - The test query

    • result or annotation - Expected response

  4. Name the eval set

  5. Click Upload

Screenshot: CSV Upload

CSV Format Example:

Method 3: Manual Entry

Add test cases one at a time:

  1. Go to Eval tab

  2. Click Add Test Case

  3. Enter the input question

  4. Enter the expected output

  5. Save

Organizing Eval Sets

Create multiple eval sets for different purposes:

Eval Set Name
Purpose

"Core FAQs"

Basic functionality testing

"Edge Cases"

Unusual or difficult queries

"Regression Tests"

Ensure fixes don't break

"New Feature Tests"

Validate new capabilities


Running Evaluations

The Evaluation Process

Running an Evaluation

  1. Go to your prompt's Eval tab

  2. Select an Eval Set from the dropdown

  3. Select the Version to test

  4. Click Run Eval

  5. Wait for all test cases to complete

  6. Review results in the table

Screenshot: Run Eval Button

Understanding Results

After running, you'll see:

Column
Description

Input

The test question

Expected Output

What you wanted

Eval Response

What the AI actually said

Version

Which version was tested

Screenshot: Eval Results

Comparing Versions

Test multiple versions to find the best:

  1. Run eval with Version 1

  2. Make prompt improvements → Create Version 2

  3. Run eval with Version 2

  4. Compare results in the Runs tab

The Runs tab shows historical comparisons:

Eval Set
Version
Date
Count
Avg Score
Pass Rate

Core FAQs

v1

Dec 28

25

3.2

60%

Core FAQs

v2

Dec 29

25

4.1

84%

Core FAQs

v3

Dec 30

25

4.5

92%

![Screenshot: Runs Comparison] Screenshot: Runs Tab


Using AI Judge

AI Judge automatically evaluates response quality using another AI model.

What AI Judge Does

Instead of manually reviewing every response, AI Judge:

  1. Reads the input question

  2. Reads the expected output

  3. Reads the actual response

  4. Scores quality (1-5 scale)

  5. Explains its reasoning

Judge Evaluation Criteria

The AI Judge evaluates based on:

Criterion
Description

Accuracy

Does the response match expected output?

Completeness

Does it cover all necessary points?

Quality

Is it clear, helpful, and well-structured?

Scoring Scale

Score
Meaning
Description

5

Excellent

Exceeds expectations

4

Good

Meets expectations well

3

Average

Acceptable, room for improvement

2

Below Average

Missing key elements

1

Poor

Significantly wrong or unhelpful

Running Judge Evaluation

  1. First, run a regular evaluation

  2. Click Run Judge

  3. Wait for judge to score all results

  4. Review scores and reasoning

![Screenshot: Run Judge Button] Screenshot: Run Judge

Understanding Judge Output

After judging, you'll see:

Column
Description

Judge

AI's reasoning for the score

Score

Numeric rating (1-5)

Example Judge Output:

Aggregate Metrics

Judge evaluation calculates:

  • Average Score: Mean of all test case scores

  • Pass Rate: Percentage scoring 3 or above

These appear in the Runs tab for easy comparison across versions.


AI-Assisted Improvement

PromptOwl can use AI to suggest improvements based on real conversations and feedback.

Accessing Improve with AI

  1. Open your prompt in edit mode

  2. Click Improve with AI button

  3. The improvement dialog opens

![Screenshot: Improve Button] Screenshot: Improve with AI

Improvement Sources

The AI can analyze:

1. Example Conversations

Paste or auto-load real conversations:

2. User Feedback

Your specific improvement requests:

3. Annotations

Auto-loaded feedback from production:

  • Message-level annotations

  • Conversation-level notes

  • Sentiment data

How Improvement Works

Prompt-Type Specific Improvements

AI understands your prompt type and optimizes accordingly:

Simple Prompts

Focus areas:

  • Clarity of instructions

  • Response format consistency

  • Tone and personality

  • Edge case handling

Sequential Prompts

Focus areas:

  • Block efficiency and necessity

  • Data flow between blocks

  • Inter-step coordination

  • Output quality at each stage

Supervisor Prompts

Focus areas:

  • Task routing logic

  • Agent specialization clarity

  • Delegation decisions

  • Response synthesis

Applying Improvements

Improvements are applied non-destructively:

  1. Block Names: Updated if AI suggests better names

  2. Content: Saved as new variations (original preserved)

  3. Testing: A/B test original vs improved

![Screenshot: Improvement Results] Screenshot: Improvement Suggestions

Improvement-to-Evaluation Loop

After applying improvements:

  1. New variations are created

  2. Run evaluation against the new version

  3. Compare scores with previous version

  4. If improved, publish; if not, iterate


Best Practices

Annotation Best Practices

Collecting Effective Feedback

Practice
Why It Matters

Annotate immediately

Details are fresh

Be specific

Vague feedback isn't actionable

Include examples

Show what you expected

Note patterns

Same issue multiple times = priority

Building Annotation Culture

  • Train team on annotation guidelines

  • Celebrate improvements from feedback

  • Share "before/after" examples

  • Make annotation part of workflow

Monitoring Best Practices

What to Monitor Daily

  • Conversation volume trends

  • Negative sentiment spikes

  • Unusual error patterns

  • Token usage anomalies

What to Review Weekly

  • Aggregate satisfaction scores

  • Common failure patterns

  • Top annotation themes

  • Version performance comparison

Setting Up Alerts

Consider monitoring for:

  • Satisfaction score drops below threshold

  • Unusual volume changes

  • High negative annotation rates

Evaluation Best Practices

Building Quality Eval Sets

Guideline
Implementation

Cover common cases

70% should be typical questions

Include edge cases

20% should test boundaries

Add failure scenarios

10% should test error handling

Update regularly

Add new cases from production

Running Effective Evaluations

  • Baseline first: Eval current version before changes

  • One variable: Change one thing, then eval

  • Statistical significance: Use enough test cases (25+)

  • Judge consistently: Use same judge prompt

Interpreting Results

Metric
Target
Action if Below

Pass Rate

>80%

Review failing cases

Avg Score

>4.0

Focus on low scorers

Consistency

Low variance

Investigate outliers

Improvement Best Practices

When to Use AI Improvement

  • After identifying patterns in annotations

  • When manual tweaking isn't working

  • To get fresh perspective on prompt

  • When scaling improvement efforts

What to Include in Feedback

Helpful feedback:

Not helpful:

Validating Improvements

  1. Don't trust blindly: AI suggestions need human review

  2. Test systematically: Run eval before publishing

  3. Start with one change: Don't apply all suggestions at once

  4. Monitor after publish: Watch for regressions

Continuous Improvement Workflow

Daily

Weekly

Monthly


Troubleshooting

Annotations not appearing

  1. Check annotation feature is enabled (Enterprise Settings)

  2. Verify user has permission to annotate

  3. Refresh the Monitor view

  4. Check correct prompt is selected

Evaluation failing

  1. Verify eval set has test cases

  2. Check prompt version exists

  3. Ensure API keys are configured

  4. Try running single test case first

Judge scores seem wrong

  1. Review judge prompt (uses production version)

  2. Check expected outputs are realistic

  3. Verify input/output alignment in eval set

  4. Consider adjusting pass threshold

AI improvement suggestions unhelpful

  1. Provide more specific feedback

  2. Include actual conversation examples

  3. Describe what good looks like

  4. Try with different conversation samples

Metrics not updating

  1. Refresh the page

  2. Check date range filters

  3. Verify conversations are completed

  4. Allow time for aggregation


Quick Reference

Keyboard Shortcuts

Action
Shortcut

Refresh Monitor

Ctrl/Cmd + R

Search

Ctrl/Cmd + F

Save annotation

Enter (in modal)

Key Metrics Glossary

Metric
Formula
Good Target

Pass Rate

(Scores ≥ 3) / Total

>80%

Avg Score

Sum(Scores) / Count

>4.0

Satisfaction

Positive / (Positive + Negative)

>85%

Eval Set Size Guidelines

Purpose
Recommended Size

Quick check

10-15 cases

Standard eval

25-50 cases

Comprehensive

100+ cases


Last updated