Monitoring, Evaluation, and Improvement
Monitor AI agent performance in PromptOwl with evaluation sets, AI Judge scoring, annotations, and automated prompt improvement workflows.
Table of Contents
The Improvement Lifecycle
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ DEPLOY │────▶│ MONITOR │────▶│ COLLECT │ │
│ │ Agent │ │ Usage │ │ Feedback │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ PUBLISH │◀────│ IMPROVE │◀────│ EVALUATE │ │
│ │ Update │ │ Prompt │ │ Quality │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘The Six Steps
Step
Action
Purpose
Collecting Quality Feedback
Types of Feedback
1. Message-Level Annotations
2. Conversation-Level Annotations
Encouraging Quality Feedback
For Internal Teams
Annotation Guidelines
Do
Don't

Sentiment Best Practices
Sentiment
When to Use
Monitoring Performance
Accessing the Monitor

Monitor Interface Overview
History Tab
Column
Description

All Annotations Tab
Column
Description

Key Metrics to Track
Conversation Metrics
Quality Metrics
Usage Metrics
Filtering and Search

Viewing Conversation Details

Creating Evaluation Sets
What's in an Eval Set?
Component
Description
Required
Method 1: Create from Annotations

Method 2: Upload CSV

Method 3: Manual Entry
Organizing Eval Sets
Eval Set Name
Purpose
Running Evaluations
The Evaluation Process
Running an Evaluation

Understanding Results
Column
Description

Comparing Versions
Eval Set
Version
Date
Count
Avg Score
Pass Rate
Using AI Judge
What AI Judge Does
Judge Evaluation Criteria
Criterion
Description
Scoring Scale
Score
Meaning
Description
Running Judge Evaluation
Understanding Judge Output
Column
Description
Aggregate Metrics
AI-Assisted Improvement
Accessing Improve with AI
Improvement Sources
1. Example Conversations
2. User Feedback
3. Annotations
How Improvement Works
Prompt-Type Specific Improvements
Simple Prompts
Sequential Prompts
Supervisor Prompts
Applying Improvements
Improvement-to-Evaluation Loop
Best Practices
Annotation Best Practices
Collecting Effective Feedback
Practice
Why It Matters
Building Annotation Culture
Monitoring Best Practices
What to Monitor Daily
What to Review Weekly
Setting Up Alerts
Evaluation Best Practices
Building Quality Eval Sets
Guideline
Implementation
Running Effective Evaluations
Interpreting Results
Metric
Target
Action if Below
Improvement Best Practices
When to Use AI Improvement
What to Include in Feedback
Validating Improvements
Continuous Improvement Workflow
Daily
Weekly
Monthly
Troubleshooting
Annotations not appearing
Evaluation failing
Judge scores seem wrong
AI improvement suggestions unhelpful
Metrics not updating
Quick Reference
Keyboard Shortcuts
Action
Shortcut
Key Metrics Glossary
Metric
Formula
Good Target
Eval Set Size Guidelines
Purpose
Recommended Size
Related Guides
Last updated



