Back to Getting Started
Getting Started

Weekly Tuning Session Checklist

30-minute weekly review template: queue patterns, prompt adjustments, data quality, metrics.

Weekly Tuning Session Checklist

Your AI systems drift. Prompts that worked last month produce mediocre outputs today. Response times creep up. Edge cases multiply. Without regular maintenance, you're burning tokens on subpar results.

This 30-minute weekly review catches degradation before it reaches clients. Use it every Monday morning or Friday afternoon - pick a slot and protect it. You'll review four areas: queue patterns, prompt performance, data quality, and cost metrics.

Pre-Session Setup (5 minutes)

Pull these reports before you start:

  • Last 7 days of API logs from OpenAI/Anthropic dashboard
  • Token usage by endpoint (available in usage reports)
  • Error rate by prompt template (track in your application logs)
  • Average response time by request type (from your monitoring tool)

If you don't have monitoring in place, set up basic logging this week. At minimum, log: timestamp, prompt template ID, token count, response time, and any error codes.

1. Queue Pattern Analysis (8 minutes)

What to check:

Open your API

logs. Sort by volume. Identify your top 5 prompt types by request count.

Compare this week to last week:

  • Did any prompt type jump more than 30% in volume?
  • Did any new prompt category appear in your top 10?
  • Are requests clustering at specific times (indicating batch jobs or user behavior patterns)?

Red flags:

  • Sudden spike in a single prompt type (indicates a user found a workaround or a process broke)
  • Requests timing out consistently between 2-4 AM (batch job needs optimization)
  • New prompt patterns you didn't design (users are improvising - capture and standardize these)

Action checklist:

  • [ ] Document the top 5 prompt types and their weekly volume
  • [ ] Flag any prompt type with >30% volume change
  • [ ] Identify 1-2 improvised prompts to convert into official templates
  • [ ] Note any time-of-day clustering for batch job review

Example finding: "Client intake summary" prompts jumped from 45/week to 127/week. Users are bypassing the manual intake form. Convert this into an official workflow and add structured output formatting.

2. Prompt Performance Review (10 minutes)

What to check:

Pull 5 random outputs from your highest-volume prompt template. Read them completely. Ask:

  • Does the output match the intended format?
  • Are there repeated phrases or filler content?
  • Does it hallucinate facts or make unsupported claims?
  • Would you send this to a client without editing?

Now check your lowest-performing prompt (highest error rate or longest response time). Run it 3 times with the same input. Compare outputs.

Red flags:

  • Outputs vary wildly between runs (prompt is too open-ended)
  • Model consistently ignores specific instructions (instruction is buried or unclear)
  • Response includes "As an AI language model" or similar hedging (system prompt needs work)
  • Output exceeds needed length by 2x (you're wasting tokens)

Action checklist:

  • [ ] Test your top prompt template with 5 random inputs
  • [ ] Identify 1 specific improvement for your worst-performing prompt
  • [ ] Check if any prompts can use a cheaper model (GPT-4 → GPT-3.5 or Claude Opus → Sonnet)
  • [ ] Update 1 prompt template based on findings

Specific fixes:

Replace vague instructions:

  • Bad: "Summarize this document professionally"
  • Good: "Create a 3-paragraph summary. P1: Key findings. P2: Methodology. P3: Recommendations. Use bullet points for lists. Max 200 words."

Add output constraints:

  • Bad: "Draft a client email"
  • Good: "Draft a client email. Subject line: [topic]. Body: 4 sentences max. Tone: direct, no apologies. End with a specific next step and deadline."

3. Data Quality Spot Check (5 minutes)

What to check:

If you're using RAG

(retrieval-augmented generation) or fine-tuning:

Open your vector database or training dataset. Pull 10 random documents. Verify:

  • Are dates current (nothing older than 6 months unless historical reference)?
  • Do documents match your current service offerings?
  • Are there obvious errors (formatting issues, truncated text, OCR mistakes)?

Red flags:

  • Documents reference discontinued services
  • Pricing information is outdated
  • Legal disclaimers are from previous policy versions
  • Source documents have poor OCR quality (common with scanned PDFs)

Action checklist:

  • [ ] Review 10 random documents from your knowledge base
  • [ ] Flag any documents older than 6 months for review
  • [ ] Remove or update 1 outdated document
  • [ ] Add 1 new document if you launched a service or changed a process this week

Quick win: Set a document expiration policy. Tag every document with a "review by" date. Auto-flag anything past that date during your weekly session.

4. Cost and Performance Metrics (7 minutes)

What to check:

Open your billing dashboard. Compare this week to last week:

  • Total tokens used
  • Cost per request type
  • Average tokens per request
  • Percentage of requests using GPT-4 vs GPT-3.5 (or Claude Opus vs Sonnet)

Calculate your cost per output type:

  • Client deliverable: $X per document
  • Internal summary: $X per summary
  • Email draft: $X per email

Red flags:

  • Token usage increased but request volume stayed flat (prompts are getting bloated)
  • More than 40% of requests use your most expensive model (you're over-provisioning)
  • Cost per request type varies by more than 50% week-to-week (inconsistent inputs or prompt drift)

Action checklist:

  • [ ] Calculate cost per output for your top 3 use cases
  • [ ] Identify 1 prompt that can move to a cheaper model
  • [ ] Set a token budget alert (most providers offer this)
  • [ ] Document your baseline metrics for next week's comparison

Model selection guide:

Use GPT-4/Claude Opus for:

  • Client-facing deliverables
  • Complex analysis requiring multi-step reasoning
  • Outputs where errors have high cost

Use GPT-3.5/Claude Sonnet for:

  • Internal summaries
  • Email drafts
  • Data extraction from structured documents
  • Anything with clear right/wrong answers

Cost optimization example: Moving internal meeting summaries from GPT-4 to GPT-3.5 Turbo saved one firm $340/month with no quality loss. Test this with 20 outputs before switching completely.

Weekly Action Summary Template

Copy this into your notes each week:

Week of: [DATE]

TOP FINDING:
[One sentence describing the biggest issue or opportunity]

CHANGES MADE:
1. [Specific prompt update, model change, or process fix]
2. [Second change]
3. [Third change]

METRICS:
- Total requests: [number] (vs [number] last week)
- Total cost: $[amount] (vs $[amount] last week)
- Avg response time: [seconds]
- Error rate: [percentage]

NEXT WEEK FOCUS:
[One specific thing to investigate or improve]

Bottom Line

This checklist prevents the slow degradation that kills AI ROI. Thirty minutes per week catches prompt drift, cost creep, and data staleness before they compound.

The firms that succeed with AI treat it like production software - they monitor, tune, and improve continuously. The firms that fail treat it like magic - they deploy once and wonder why results deteriorate.

Schedule your first session now. Put it on your calendar as a recurring meeting with yourself. Protect that time. Your AI systems will only be as good as the attention you give them.

Revenue Institute

Reviewed by Revenue Institute

This guide is actively maintained and reviewed by the implementation experts at Revenue Institute. As the creators of The AI Workforce Playbook, we test and deploy these exact frameworks for professional services firms scaling without new headcount.

Revenue Institute

Need help turning this guide into reality? Revenue Institute builds and implements the AI workforce for professional services firms.

RevenueInstitute.com