Measuring Copilot ROI without lying to yourself

Every Copilot business case eventually produces the same slide: “users report saving X hours per week,” multiplied by headcount, multiplied by a loaded hourly rate, equaling a number large enough to justify anything. The CFO nods, the renewal goes through, and nobody in the room believes the number — including the person presenting it.

You can do better, and you should, because the honest math for Copilot is frequently good. It just isn’t the survey math. The survey math is so inflated that when reality eventually intrudes — usually via a usage report at renewal — it takes the credible gains down with it.

Why “hours saved” surveys overcount

Four mechanisms, all of them well-understood and all of them ignored:

Estimation inflation. “How much time does Copilot save you weekly?” invites people to recall their best moment and project it across the week. The vivid memory of one 45-minute save becomes “a few hours a week” on the form. Self-reported time data diverges from observed time data reliably and always in the flattering direction.
Saved minutes aren’t harvested minutes. A meeting recap that saves 12 minutes does not produce 12 minutes of additional output. It produces a slightly earlier coffee. Fragmented micro-savings only convert to business value when they aggregate into something redeployable — or when they shorten a cycle time someone is waiting on. Multiplying minute-fragments by hourly rates is fiction arithmetic.
Survivorship. Surveys get answered by people who use the product. The 40% of licenses sitting dormant don’t fill in your form, but they absolutely show up on your invoice. The survey measures your fans; the cost measures your tenant.
The demo-week effect. Sentiment surveys run in the honeymoon period capture novelty enthusiasm. The number you need is from month four.

Keep running sentiment surveys — they’re useful as adoption health data and they catch friction early. Just stop calling the output ROI.

A survey tells you whether people like Copilot. A stopwatch tells you whether it’s working. These are different instruments measuring different things, and only one of them belongs in a business case.

What to measure instead

1. Task-level before/after timing on five workflows

Pick five — not twenty — recurring workflows where Copilot plausibly changes the work, and time them properly: real practitioners, real artifacts, measured before enablement (you did capture a baseline at day zero, per the 90-day playbook?) and again at day 90+, after habits have set.

Good candidates: meeting follow-up production (recap → reviewed action items → sent email), first-draft proposal or SOW from a template, weekly status compilation, inbox-to-triaged each morning, RFP/security-questionnaire response sections.

Rules that keep the data honest: time to finished, reviewed quality — not to first draft, because the review pass is real work and Copilot output always needs one. Use a handful of people per workflow, take medians. Include the failures: the prompt that needed three retries is part of the true cost, and excluding it is the survey error in a lab coat.

Five workflows won’t capture all the value. That’s fine. A defensible floor beats an indefensible ceiling. “We measured 31% time reduction across five workflows that account for roughly six hours per person-week” is a sentence that survives interrogation.

2. Ticket deflection for agents

Agents are the easy ROI story because they have a natural control: the queue. If your HR-policy agent works, HR’s question volume drops, and ticketing systems already count this. Measure monthly tickets in the covered category for the three months before launch and after; check agent conversation volume (from your agent analytics) against the ticket decline so you’re seeing substitution rather than coincidence; and spot-check answer quality — Copilot Studio’s evaluation API lets you automate regression checks via REST, so quality can be a monitored metric rather than an annual panic.

Deflected tickets convert to dollars credibly because the counterfactual cost is known: your help desk already has a cost-per-ticket. This is the cleanest line in the whole business case — which is one more reason your first agent should target a high-volume queue.

3. Cycle time on document production

Distinct from task timing: not “how many minutes of effort,” but how many days from request to delivered. Proposal turnaround, contract-draft delivery, report publication. Cycle time is the metric leadership actually feels — a proposal that goes out in two days instead of five changes win rates, not just timesheets — and it’s resistant to the harvest problem, because elapsed time is externally observable. Pull request-to-delivery dates from whatever system tracks the work; even a sampled manual pull of 20 documents per quarter beats no data.

The instrumentation you already have

Copilot analytics in the Microsoft 365 admin center is your usage backbone: active users by app, feature-level adoption, trends. It will not tell you value — it tells you who is in a position to generate value, which is exactly the denominator the survey approach quietly drops. Pull it monthly; treat sustained dormancy as a license-reassignment trigger, not a footnote.

The agent usage estimator in Copilot Studio does the forecasting side: projected Copilot Credit consumption for an agent before you ship it. Run it pre-launch, then compare actuals to estimate monthly. An agent whose credit burn runs far above estimate with no matching ticket deflection is an agent that’s being chatted with rather than used — a real pattern, and an expensive one at tenant-grounded rates.

The honest framework: cost per active user vs measured savings

Assemble it like this:

Cost side. License spend + credit spend (packs plus pay-as-you-go) + program cost (champion time, training, agent build time — yes, count it). Divide by monthly active users, not licensed users. This is the step most business cases dodge, because it’s the step that hurts: at 55% active, your effective cost per user-who-could-be-benefiting is nearly double the sticker price.

Value side. Only measured items: (timed workflow savings × workflow frequency × the people demonstrably doing that workflow with Copilot) + (deflected tickets × cost per ticket) + cycle-time gains, which you may choose to report in days rather than dollars — “proposal turnaround down 40%” needs no hourly-rate fiction to land.

Then publish the comparison, including when it’s unflattering. A typical honest mid-rollout readout looks like: cost per active user ~$54/month all-in; measured, defensible savings ~$95–140/month per active user across the five workflows; agent deflection adding a separate $3k/month line; 38% of licenses dormant and scheduled for reassignment. That’s a good result — roughly 2x return on the measured floor — and it’s believable precisely because it admits the dormancy number and excludes everything unmeasured.

Metric	Instrument	Trap it avoids
Workflow time, before/after	Stopwatch, 5 workflows, medians, to reviewed-quality	Estimation inflation
Ticket deflection	Ticketing system + agent analytics + automated evals	Attribution hand-waving
Document cycle time	Request-to-delivery dates	The “saved minutes” harvest problem
Cost per active user	Admin center analytics ÷ full cost stack	Survivorship — dormant licenses hiding in the denominator
Credit forecast vs actual	Usage estimator + billing	Agent costs discovered at invoice time

The closing argument

The dishonest version of Copilot ROI is a big number nobody trusts. The honest version is a smaller number that survives every question in the room — and a management system as a side effect: dormant licenses get recycled, weak agents get fixed or killed, and the five measured workflows become the template for choosing the next five. If the honest math doesn’t clear the bar at your active-user rate, that’s not a measurement failure. That’s the measurement working — telling you to fix adoption before you fix the slide.