Tokenomics

Leaders measuring productivity in tokens are mostly right. Most token measures used today are garbage. Most people don’t know what they’re talking about when they talk about tokens as a measure of productivity.

Tokenomics
Sophie waiting for her food on the couch

Let’s talk about tokens. I have a few things I want to say about the topic, and they’re all slightly controversial.

  • Leaders measuring productivity in tokens are mostly right
  • Most token measures used today are garbage
  • Most people don’t know what they’re talking about when they talk about tokens as a measure of productivity.

Let’s start with measuring productivity in tokens.

More tokens are just better

“More tokens are just better” is a very controversial statement. Leaders at Big Tech are espousing perf reviews based on tokens, building leaderboards, and getting a bunch of flak. The engineers actually doing the work are clear that’s not how you measure productivity.

I agree. It’s not as simple as more tokens = more productivity, definitely. And leaders doing perf reviews and leaderboards is the wrong way to go about it (I’ll talk more about this later). But it IS directionally accurate, and in this AI adoption phase, no tokens = inadequate productivity is almost certainly true.

In a world where AI can significantly leverage not only a software engineer’s productivity but literally any knowledge worker’s productivity, you’ll find no solace in me by arguing you can be just as productive without AI. Sorry.

Does it translate linearly? I think it does, roughly but not necessarily. It does also require a lot of competence and the proper harnesses and guard-rails to work well in this linear capacity. Yes, a dumbass can still burn a ton of tokens and just build crap with no value, and a talented engineer and AI operator can significantly overcome them in productivity with far fewer tokens — that’s true.

But when I say “far fewer tokens” for that 10x engineer, that’s still, like, .. a lot of tokens. Like hitting the weekly usage limits on the 20x Claude Max account amount of tokens.

Besides, and here I’m speculating, I think the people that are really good at AI now with “fewer” tokens will, in the future, use their current skills and improve them to use MORE tokens (not fewer) and they will be even more productive than they are today.

In short, I don’t think we’re getting close to the ceiling of token usage for top performers — not even close. As models and harnesses improve in capabilities, even if they improve only marginally (although I personally believe they’ll improve significantly), they’ll become more autonomous, and competent AI operators will be even more leveraged than they are now.

Now, the question of how to compare the dumbass that wastes bucketloads of tokens with the 10x engineer who uses fewer tokens and gets way more done is an important one to tackle. We’ll need to go into engineering productivity, which is an interesting but still mostly uncracked nut.

But before we do that, we first have to understand how token measures work and why most measures out there are garbage.

How many tokens did you use?

“How many tokens did you use?” really is kind of a senseless question. There are input tokens and output tokens (I’ll cover cache writes and reads shortly, hold on).

Measuring tokens as a single value is like measuring someone’s fitness in calories without making a distinction between calories they ate and calories they spent exercising. The BJJ practitioner after practice “used” as many calories as the guy who just had a banana split dessert!

Now, the argument could be “well, but those output tokens are now the input tokens of the next agentic loop.” Yep! So is the banana split!!

Token is just a unit. The 4 different actions (input, write to cache, read from cache, and output) is the actual work being done, measured in the unit of tokens.

Let’s say the 3 volumes of Lord of the Rings have roughly 650,000 tokens.

Here are two scenarios where I “use” 650k tokens:

  1. I input the whole Lord of the Rings corpus, and get back “meh” from the AI
  2. I input “meh” and get back the whole Lord of the Rings!

Do they sound anywhere close to comparable to you in AI usage? They don’t to me.

Here’s my thesis: the “work” done by AI is measured in Output Tokens. Whether those are code you’ll push, input for your next message, reasoning / thinking to ensure better results, an intermediate output you’ll read and manually iterate on, or wasted garbage either that’s no use or that is clogging your context window from now on, that’s what the AI is ACTUALLY DOING (measured in tokens).

Now, nobody’s gonna get Lord of the Rings out of a “meh” input, even though I’ve seen some cool examples of one-shot apps being built w/ GPT-5.4. Whether it’s you or AI, input is the fuel for the AI’s output, and if it was AI who created it, those tokens are presumably measured in some other output token measure somewhere.

So let’s talk about the other types of actions AI does with tokens.

Input, cache-reads, and cache-writes

Before I talk about cache-writes and cache-reads, I need to explain a basic thing about how AI works with LLMs today because in case you don’t know this detail, caches won’t make sense:

Every time we ask AI something, we send back the whole content of our conversation. They have no memory.

So if you and I were texting, because you have memory, the messages sent from my phone to yours would look something like:

  • “Hi there reader” -> “Hi, Dui”
  • “How’s it going?” -> “Well, you?”

Because AI doesn’t have memory, the message sent looks more like this:

  • “Hi there Claude” -> “Hi, Dui”
  • “Hi there Claude|Hi, Dui|How’s it going?” -> “Well, you?”

My original message and Claude’s original reply are part of my 2nd message, prepended to my follow-up question. Otherwise, Claude wouldn’t know what we talked about before.

This is critical because that’s the main reason for caches:

  • Input tokens are the tokens loaded into the AI so it can generate the output tokens. Input tokens are rarely used today, because so much input is repeated, like above.
  • Cache-writes are inputs that get stored so they can be used again. Typically they’re available for 5 minutes, so if I send the same input within 5 minutes, it’ll read it from cache.
  • Cache-reads are inputs that are read from storage. Reading cache costs 10% of the price of an input because they use way fewer resources.

In the Claude conversation example, when I sent my second message:

  • Hi there Claude would be read from cache (cache-read)
  • Hi, Dui|How’s it going?” would be written in cache (cache-write)
  • Well, you? would be output

If my follow-up is sent within 5 minutes, then Hi there Claude|Hi, Dui|How’s it going? would be read from cache, and so on.

OK, why does this matter? Well, because input tokens are the banana split! We want to get output, but ideally we want to get it with as little input as possible. Cache reads, in particular, are the emptiest of calories, because we’re just repeating ourselves.

If we count cache-reads as part our token usage, I can ramp up my 650k tokens Lord of the Rings to tens of billions by just sending it back and forth as part of a cache read, while in practice I’m just literally wasting resources by sending AI the exact same input over and over while AI is not doing any work.

But guess where the gross of our spend in AI typically is? You guessed it: Reads from cache. In fact, cache-reads often easily dominates our expenditures even though their cost is 10% of that of inputs and only 2% of that of output tokens!

Cache reads are the cost of agentic AI: they do things, then they do more based on their previous output and our steering. All else is not equal, but if it were, you’d be trying to get as much output as possible with as little cache writes and reads as possible.

In short: As it exists today, the goal is to get the output you want with as little input as you can get away with.

But how does that translate to software engineering productivity?

How productive are you?

Let’s stop kidding ourselves — these two things are simultaneously true:

  1. Some software engineers are better, and add more value to the company, than others.
  2. Measuring a software engineer’s productivity is really hard, and nobody does this well today in the industry.

Can we start there? If you don’t agree with the above, then nothing I tell you next will make sense.

The current golden standard for software engineering productivity is DORA metrics, and we’ve been doing some progress with Nicole Forsgren et al in getting some more measures around software engineering productivity.

One thing in common about all these metrics is that none of them are individual metrics! Not one productivity metric currently well accepted by the industry is individual. They’re all either team or departmental metrics.

Now, does that mean literally every metric you collect on an individual engineer is basically noise and informs you of nothing? No, it doesn’t.

Software engineering metrics are not very accurate measures of productivity, but they’re not random.

In particular, one big problem about the metrics is that the metrics themselves are typically very precise, but whenever you have a precise metric that’s an inaccurate proxy for a different complex measure, you can end up in trouble by trying to replace one for the other. Like say, I don’t know, tokens for performance on the job.

Software engineering productivity, like ‘health’ and ‘fitness,’ is a holistic measure, and you can get a good picture of this measure if you embrace its inaccuracy and consolidate all the metrics that may give signals about it.

How many PRs were closed? How long were they opened? How many tickets closed? How many story points? Did that change? Increase? Decrease? How many bugs? Incidents? Did that increase? Support tickets? Output tokens? Increased? Decreased?

Trying to measure something as holistic as productivity with tokens is like trying to measure my health based on the size of my side delts. Yes, there’s a good chance I’m overall strong if my side delts are strong — I’m probably training everything else if I’m making time for lateral raises. And there’s some chance I’m healthy if I’m strong, because strong people are more often healthy than weak ones. But my side delts’ size alone says very little about my overall health, and there’s a huge margin for inaccuracy. Literally the only thing it says well is how big my side delts are!

In short, I think there’s nothing wrong with measuring token usage, particularly output tokens, to get a signal on software engineering (and overall knowledge work) productivity.

But getting a single limited metric and assuming it’s a valid proxy for a complex holistic measure says a little about how bad the metric is and a lot about how bad the leadership is.

Measure your output tokens!

Yes, you should absolutely measure token usage, broken down by type, at an individual, team, department, and company level, just like you should measure a lot of other metrics in order to have proper analytics about your operations.

I also believe strongly that token usage, particularly output tokens, is an important signal of productivity for any knowledge worker, as long as it’s not poorly used in isolation as a proxy for productivity.

Now, go find out how many output tokens you’ve used.