How to Run an AI Pilot That Actually Scales

I have watched dozens of AI pilots. Most of them worked. Almost none of them scaled. That gap — between a demo that impresses a steering committee and a system that runs in production every day — is where budgets go to die. We even have a name for it: pilot purgatory.

After running AI and ERP change programs at Novartis, I stopped blaming the models. The pilots that failed were not technically worse. They were designed wrong from day one. A pilot is not a science experiment. It is a business case you build in miniature, under real conditions, so that saying yes to scale becomes the obvious decision.

Here is how I set them up now.

1. Pick a use case with a boring, expensive problem

The temptation is to pilot the most exciting thing. Resist it. The best first use case is a repetitive, high-volume task that people already hate and that costs real money.

One of ours: classifying and routing incoming supplier invoices. Not glamorous. But 40,000 invoices a month, each touched by a human for 90 seconds, is a number a CFO understands. If AI removes even 60% of that manual touch, the value is obvious before you write a line of code.

Ask two questions before committing:

Is the pain measurable in money or hours today? If nobody can quote the current cost, you cannot prove savings later.
Does someone senior already lose sleep over it? A sponsor with skin in the game is worth more than any accuracy metric.

2. Use real data, not the clean demo set

Vendor demos run on tidy, curated data. Production does not. The fastest way to kill a pilot's credibility is to have it collapse the first time it meets a scanned PDF from 2014.

I insist on piloting against a live slice of messy, current production data — with all its typos, missing fields, and edge cases. Yes, accuracy drops. That is the point. I would rather see 82% on real data than 97% on a fantasy. The real number is the one you will have to defend when you scale.

3. Define success before you start — in production terms

This is where most pilots quietly fail. "It works well" is not a success criterion. Write down the exact bar for a go/no-go decision, agreed with the sponsor, before the pilot begins.

A pilot without a pre-agreed success threshold is not a pilot. It is a very expensive way to generate opinions.

For the invoice pilot, the criteria were concrete:

Straight-through processing rate above 65% with zero human touch.
Error rate below the current human baseline of 1.8%.
Cost per invoice under 0.12 CHF, all-in, including model and review time.

Notice these are production numbers, not lab numbers. They already assume the messy data, the exception handling, and the humans who stay in the loop.

4. Design the path to scale into the pilot itself

The mistake I made early on: treating the pilot as a throwaway prototype, then discovering that nothing about it could survive contact with IT, security, or the actual volume. You end up rebuilding everything, and momentum dies during the six-month wait.

Now the pilot runs on the same rails production would use:

Same data pipeline. If the real feed is a nightly SAP export, the pilot reads that, not a spreadsheet a colleague emailed you.
Security and compliance in the room from week one. Not to approve the final thing — to tell you now what would block it later.
A named owner for the scaled version. If no team wants to own it in production, you are building an orphan.

A pilot designed this way is roughly 70% of the way to production when it ends. A pilot designed as a demo is 10%, and that last 90% is where projects rot.

5. Time-box brutally

Six to eight weeks. If it cannot show signal in eight weeks, either the use case is wrong or the data is not ready — and both of those are cheaper to learn fast. Long pilots do not produce more certainty. They produce sunk-cost attachment and stale sponsors.

What actually separates the winners

The pilots that scale are not the ones with the fanciest models. They are the ones where, on the final day, the business case writes itself: here is the real accuracy on real data, here is the cost, here is who owns it, and here is the sponsor who already wants it live.

Scaling should feel like the least dramatic decision in the room — a formality, because the pilot already proved everything that mattered. If your go/no-go meeting is a debate, your pilot did not do its job.

Stop building pilots to prove that AI is impressive. Build them to prove that this specific use case, with your data, at your cost, is ready. That is the pilot that escapes purgatory.

Cédric Bignet is an AI & ERP Change Management expert at Novartis and founder of AInspire. He writes about change management, AI adoption and enterprise transformation.

Connect on LinkedIn → More articles →