Token Economics for People Who Skipped the Engineering Track
What you wish someone had told you before you ran 80,000 records through GPT-4.
↳ paired with: The Bill Arrives on LinkedIn
Field Note — Originally posted on LinkedIn 20 May 2026
Spent the day shadowing a VP of FP&A at a Tier 1 manufacturer.
He met me in the lobby at 7:52am holding a closed MacBook, a Liquid Death, and a quarter-zip with the company logo embroidered over the heart. Said he had been up since five vibing.
"You're going to love what I built this weekend."
His standing desk was already raised when we got upstairs. On the screen was a Replit app called TicketSenseAI with a logo that looked like Clippy if Clippy had been laid off and joined CrossFit.
"It reads every open SAP ticket in the queue and tells me what's actually broken. I shipped it last night. Demoing to the COO at eleven."
I asked how many tickets were in the queue.
"Eighty-something thousand. We've been backlogged since Q3."
He hit Run.
By 9am it had processed about three hundred. He kept the OpenAI usage dashboard open in a second tab. The number climbed. He didn't seem to notice.
10am, he stepped out to a quick sync. When he came back he refreshed the dashboard, paused, then opened a new tab and asked Claude what a TPM rate limit was.
The COO demo was, by his account, a home run. He showed her three handpicked summaries. She nodded a lot. He told her it would replace two contractors. She asked if it could also do Salesforce.
Lunch was a desk salad while the app ran. He was eating with one hand and vibe coding with the other, adding a feature he called Executive Mode that would email the COO a daily digest using the most expensive model available. He did not turn off the cheaper pass already running.
I asked if he had written code before this year.
He didn't look up.
"Econ degree from Wake. My old boss used to say I had good instincts for systems. He retired in February. He used to come to my kids' soccer games."
By 3pm the app had processed about forty thousand tickets and the dashboard was loading slowly. He pushed another change.
At 4:47pm an email arrived from his AmEx concierge confirming a charge of $11,400 from a vendor he didn't recognize. He stared at it. Then a second one. $4,200. Then a Teams DM from a procurement analyst three levels down.
Hey, are you running the API account? Finance flagged it.
He turned to me.
"Don't put this in your write-up. The COO is going to love this thing. We just need to figure out the bill stuff."
By 5:30 he had asked Claude to draft a memo titled "AI Innovation Cost Variance, Q4." First sentence: As is typical with breakthrough technologies.
He pasted it into Outlook. Glanced at it once. Didn't change a word.
Then he leaned back.
"We've done good work today."
I asked if he was going to keep the app running overnight.
He thought about it.
"Yeah. Let's let it cook."
Learned a lot today.
Sent from my iPhone
What I couldn't really say
There is a moment in every vibe coded app where you decide whether to think about cost. It is usually around 11pm. You are on the third Liquid Death. The thing works. You hit Run.
Most of the time, this is fine. The job is small. The bill is twelve cents. You ship.
Sometimes the job is eighty thousand records and the model you used is the most expensive one your provider offers. By morning the bill is in the thousands and you are drafting a memo about how innovation cost variance is, as is typical with breakthrough technologies, hard to predict in the first quarter.
What I couldn't really say in the conference room is that the bill was a choice. It was the choice not to spend ninety seconds on the back-of-envelope estimate that anyone who has shipped a metered service for ten years does in their head before they hit Run. The choice not to use the cheap model for the easy part of the job. The choice not to cache the prompt prefix that was eighty percent of the tokens. The choice not to use the batch endpoint that costs half as much and ships overnight while you sleep.
None of these choices are technical. They are habits. The reason people who came up shipping production systems have them is they paid for them the first time, the way you just did.
Here is what I would have told you in private, if I could have.
1. Before you run anything
The back-of-envelope estimate takes ninety seconds.
- Pick the model. Look up the price per million tokens for input and output.
- Estimate the average prompt size and response size in tokens. Rough rule: one word is about 1.3 tokens.
- Multiply: (input_tokens × in_price) + (output_tokens × out_price). That is the cost per call.
- Multiply cost per call by the number of calls you plan to make.
If the result is more than $50, you are no longer prototyping. Treat this like a production job. If it is more than $500, get a second opinion before you hit Run.
Also: always pilot on 10 records first. Compare the output to your eye. Then 100. Then 1000. Then the full set. The number of jobs that failed at 80,000 because the prompt was wrong on a structural edge case that did not appear in the first three records is too many to count.
2. Model selection patterns
You do not need the premium model for most jobs.
- Classification, simple summarisation, extraction, language detection, basic translation. Use the cheap tier (Haiku, GPT-4o-mini, Gemini Flash). Ten to thirty times cheaper than the premium tier. The quality drop is usually invisible.
- Reasoning, nuanced summarisation, code generation, anything where you would be embarrassed by a wrong answer. Use the mid-tier (Sonnet, GPT-4o). This is the right default for most enterprise work.
- Multi-step reasoning, long-context analysis, agentic flows, the things you actually want to show off. Use the premium tier (Opus, GPT-4.1). Reserve it for the calls where the cost is worth the lift.
For repeated lookups where the answer should be deterministic and short, use embeddings instead of generation. Embeddings are cheap. Vector search is fast. If you are using a language model to ask "is this related to that," you are paying ten thousand times more than you need to.
3. Caching patterns
Three caches matter.
- Prompt prefix caching. If your prompt has a long static section (system prompt, instructions, examples, ontology), the provider can cache it on their end and charge you much less for that portion on subsequent calls. Anthropic offers this with the
cache_controlflag. OpenAI auto-applies it on prompts over a threshold. Use it. - Response caching. If the same input is likely to be requested twice in a 24-hour window, cache the response in Redis or even on disk. The cost saved is the entire call.
- Semantic dedup. If you are processing user-submitted text, deduplicate semantically before you call the model. Two questions that mean the same thing should not result in two billed calls.
4. Batching patterns
If your job is not time-sensitive, use the batch endpoint.
- Anthropic and OpenAI both offer Batch APIs at roughly fifty percent off the standard rate.
- Trade-off: you submit a batch and get results within 24 hours.
- For overnight jobs (summarising the ticket queue, classifying support tickets, generating embeddings for a corpus), this is free money.
Always batch by chunks of 1000 to 5000 records, not the full set in one go. If the batch fails midway, you have not lost everything.
5. Cost guardrails
What separates a hobby project from a production system is the guardrails.
- Service account, not your personal key. Your personal key has your personal credit card. Set up a service account with a hard monthly cap.
- Hard spend limit. Most providers let you set a cap that pauses the account when hit. Set it for three times what you think the month will cost. Not thirty times.
- Per-request cost logging. Log the cost of every call in your application logs, tagged by feature or job. If you cannot answer "what does this feature cost per user per month," you have not instrumented it.
- Alerts. Set up an alert when daily spend exceeds the previous 7-day rolling average by fifty percent. Surprise spend is almost always a bug or a runaway loop.
The artifacts
Two formats. Same conclusions.
For your coding agent
Paste this into Claude Code, Cursor, or any reasoning-capable coding agent pointed at your repo. It will produce a structured report and propose concrete diffs.
Paste into your coding agent. Updated 20 May 2026.
You are reviewing my codebase for LLM cost and efficiency.
Walk the repo and locate every place that calls an LLM provider (Anthropic, OpenAI, Google, Bedrock, Azure OpenAI, Cohere, custom clients, abstractions). For each call site, produce a report with:
1. CHARACTERISATION
What kind of task is this call doing? Pick one: classification, extraction, simple summarisation, reasoning, code generation, nuanced summarisation, multi-step agentic, other.
2. MODEL TIER REVIEW
What model is being called? Could a cheaper tier (Haiku, GPT-4o-mini, Gemini Flash) handle the task at acceptable quality? If yes, propose the swap and estimate the cost reduction.
3. PROMPT PREFIX CACHING
Is the prompt prefix (system prompt, instructions, examples) static across calls and longer than ~1024 tokens? If yes, propose adding cache_control (Anthropic) or note whether OpenAI's automatic caching should apply.
4. RESPONSE CACHING
Is the same input likely to be sent twice within a 24-hour window? If yes, propose adding a response cache (Redis, in-memory, or persistent).
5. SEMANTIC DEDUPLICATION
For user-submitted inputs, is there pre-call dedup of semantically similar requests? If no and the use case warrants it, propose embedding-based dedup.
6. BATCHING
Is this call time-sensitive? If no (overnight job, async summary, classification at scale), propose using the Batch API (50% cheaper, 24hr SLA) and outline the refactor.
7. GUARDRAILS
- Is the API key in source code, an env var, or a secret manager? Flag if hardcoded.
- Is cost-per-call logged anywhere?
- Is there a hard spend limit on the account?
- Are there daily spend alerts?
Flag any gaps.
OUTPUT
A structured report with one section per call site, plus a summary table at the top: file path, current monthly cost estimate, optimised monthly cost estimate, complexity to implement.
Then propose the three highest-impact changes as concrete code diffs.
Make conservative assumptions about traffic volume. Ask for actuals if a recommendation depends on them.Works with Claude Code, Cursor, and any reasoning-capable coding agent.
For your monitor
The one-page version. Before you hit Run on anything larger than 1000 records:
- Cost per call estimated and written down.
- Total cost estimated and written down. Acceptable to your boss without a memo.
- Model is the cheapest tier that does the job.
- Prompt prefix is using cache_control or equivalent.
- Job is using the Batch API if time allows.
- Piloted on 10 records, then 100. Output quality verified by a human.
- Service account in use. Personal key is not in this codebase.
- Hard spend limit set on the account.
- Per-request cost logged in application logs.
- Daily spend alert configured.
Print this. Put it next to your monitor. Or do not. We've done good work today either way.
Get the next field note below.
Pre-flight checklist
Token Economics for People Who Skipped the Engineering Track
Get the next one.
Roughly weekly. Field notes only. Unsubscribe anytime.