Not Sigmoids, Not Exponentials
Distributions instead of trajectories (Part 2: Bak and the sandpile)
In Part 1, I walked through Didier Sornette’s Log-Periodic Power Law Singularity model as an example of what it looks like when a forecasting framework is built with its own falsifier baked in.
The argument was that the AI capability discourse’s dominant frameworks commit to functional forms (exponentials, sigmoids, custom curves) without committing to the observations that would force them to abandon those forms. I used LPPLS to show that the discipline exists, and that it is tractable to draw from complex-systems literatures (pulling from models with 30+ years of discipline).
While Part 1 argued that the curves we’re fitting need better falsifiers, here I will ask whether curves are the right object to be fitting in the first place.
On the table is Per Bak’s self-organized criticality. It comes from a different intellectual lineage than Sornette’s (statistical mechanics rather than econophysics) and it makes a structurally different kind of claim. Sornette’s frame asks “is this particular trajectory accelerating toward a critical point?” Bak’s frame asks “is the trajectory even the right thing to study?” If self-organized criticality applies to a system, then trying to predict the system’s path forward is the wrong research program. Self-Organized Criticality (SOC) systems try to discern the distribution of jumps along the trajectory rather than asserting that the trajectory itself is predictable: we can predict the statistical distribution of events across many realizations, but not the size or timing of any individual event.
The pile
Put a flat table in front of you. Start dropping grains of sand on it, one at a time, always in the same spot.
For a while, nothing interesting will happen. The grains will pile up and the pile grows taller and the sides slope outward. Most grains, when they land, just sit where they fall, or roll a short distance and stop. You could watch this for a long time and it would be boring.
But the slope keeps getting steeper. There’s a point where the slope is steep enough that a new grain landing on top can cause the grains beneath it to start sliding. The first time this happens, maybe three grains slide. Then it’s quiet again. You drop more grains. Eventually another slide. Maybe this one is bigger: twenty grains. Quiet again. You keep going. Most grains do nothing. Occasionally one triggers a slide. Sometimes the slide is small; sometimes it cascades and a significant section of the pile reorganizes itself.
If you keep this up long enough, the pile settles into a state where the slope hovers around a specific value (the angle of repose) and stays there. Add a grain, sometimes a slide happens, sometimes it doesn’t, and the slope keeps coming back to roughly the same place. The pile has organized itself into this state: nobody set the slope and nobody tuned a parameter. The dynamics of the system (grains accumulating, occasional slides relieving the accumulated stress) drove the pile to this configuration on their own. This is the “self-organized” part of self-organized criticality.
Now record the size of every avalanche that occurs over a long stretch of time. Count how many grains moved in each one. Plot a histogram of those sizes.
What you find is the part that mattered to Bak. The histogram does not look like a bell curve. There is no typical avalanche size. There are lots of small slides, fewer medium ones, a handful of large ones, and occasionally a very large one… and they all sit on the same statistical curve. If you zoom in on just the small avalanches and look at their distribution, it has the same shape as if you zoom out to include the large ones. The distribution is scale-invariant: there is no characteristic event size, no built-in sense of “normal.”
This is a power-law distribution, and it is the signature of self-organized criticality. The mechanisms (grains accumulating, threshold dynamics, local interactions that can propagate) produce this distribution without anyone designing it to. And the same mechanism that produces a three-grain slide produces a thirty-thousand-grain slide. Fundamentally they are the same kinds of events, which arise from the same dynamic underpinnings.
Contrast this with what you’d see if the pile had a characteristic avalanche size. Imagine the same setup, but the sand had some property (sticky grains, say, or a maximum slope that physically cuts off propagation) that bounded how large any single slide could get. Then the histogram would have a peak. Most slides would be near a typical size, with a few smaller ones and a few larger ones tapering off. A bell curve, roughly. There would be a “normal” avalanche, and very large avalanches would be essentially impossible because the mechanism that produces them doesn’t exist in this version of the sand.
These two shapes — the power law and the bell curve — are fingerprints of two fundamentally different kinds of generating processes. A bell curve says: there is a typical event, and the system has a built-in scale, so large deviations are rare in a specific way. A power law says: there is no typical event, and the system has no built-in scale, so large deviations are produced by the same mechanism as small ones and are inevitable on long time horizons even if you can’t say when.
When forecasting a system, you could ask what curve fits the trajectory, and it could work (this is what METR does). I think that the more informative question is often what distribution fits the jumps. If the jumps are bell-curve-distributed, the trajectory has predictable structure and curve-fitting is a sensible research program. If the jumps are power-law distributed, the trajectory is dominated by rare large events that are individually unpredictable, and curve-fitting is studying the wrong object. You’d be looking at the height of the pile when the interesting structure is in the avalanche distribution.
Bak spent the last decade of his career arguing that this is not an exotic edge case. Earthquakes follow power-law size distributions (the Gutenberg-Richter law). So do forest fires, solar flares, extinction events in the fossil record, and neuronal avalanches in the cortex. His more controversial claim was that this list is not a coincidence, but that a wide class of complex systems, when driven slowly through threshold dynamics with local interactions, will naturally end up in the critical state without anyone tuning them there. Whether the strong version of that claim is right is still debated. The narrower version is that SOC is a well-defined mathematical framework, that some real-world systems exhibit it, and that the diagnostic for membership in this class is the shape of the event-size distribution.
Given that it has applied to varieties of process (forest fires, solar flares, etc. as above), can we map AI capability distributions using it? I don’t know, but we can specify the mapping that it would require, and then it would take deep work to build the measurement infrastructure that the test requires.
For readers who want the math, Bak’s How Nature Works (1996) is the canonical reference, and the original Bak-Tang-Wiesenfeld paper (1987) has the formal definition of the sandpile cellular automaton.
Setting up the test
SOC is a specific claim that the events a system produces are distributed in a specific way, and testing whether that claim applies requires data with specific properties.
Four requirements follow from the framework itself.
First, enough events. Power-law distributions are notoriously difficult to distinguish from other heavy-tailed distributions (log-normals, stretched exponentials, mixtures, etc.) in finite samples. The canonical methodological reference here is Clauset, Shalizi, and Newman’s 2009 paper on fitting power laws to empirical data, and their guidance is that you generally need at least a few hundred events to make the distinction with any confidence. Fewer than that and the test isn’t conclusive in either direction.
Second, a wide range of event sizes. The signature of SOC is scale invariance: the distribution looks the same whether you zoom in on small events or out to large ones. You cannot test scale invariance on data confined to a single order of magnitude. The events need to span at least two or three orders of magnitude in size, ideally more.
Third, homogeneous driving. The “grains” feeding the system have to be roughly interchangeable. If you fold mechanism heterogeneity into the driving input (different kinds of events being treated as the same kind), the resulting distribution is contaminated by that heterogeneity rather than revealing the underlying dynamics.
Fourth, a stable size metric. The “size” of an event has to mean the same thing across the range you’re measuring. If your metric saturates, shifts meaning, or is gameable at different scales, the shape of the distribution is an artifact of the metric rather than a property of the system.
Most candidate measurements for AI capability development satisfy some of these but not all.
Candidate 1: SOTA-breaking results across benchmarks
The most data-rich way to attempt the SOC test is to look at the historical record of state-of-the-art results across machine learning benchmarks. Papers With Code has tracked SOTA progression on hundreds of benchmarks over the last decade, and academic aggregators like HELM, Big-Bench, and the various leaderboard ecosystems capture thousands of (model, benchmark, score) triples. The event is “a paper or model release broke the SOTA on benchmark Z at time T,” and the size of the event is the magnitude of the SOTA improvement.
This operationalization satisfies the first requirement easily. We have many thousands of SOTA-break events on record. A serious analyst could compile a dataset spanning the last decade across hundreds of benchmarks and have plenty of statistical power for a power-law fit.
It also plausibly satisfies the second requirement. SOTA-break magnitudes range from fractional-percentage improvements at the top of saturated benchmarks to dramatic step-changes when a new architecture clears a previously stuck task. The events span at least a couple of orders of magnitude in raw size, possibly more depending on how you measure.
The third and fourth requirements aren’t as clean to map.
The “size” of a SOTA-break is structurally entangled with how the benchmark was constructed. Benchmarks are designed to discriminate among models within some target capability range. A new benchmark released when the field is at 30% on it has lots of room to grow; the early SOTA-breaks on it tend to be large in raw-percentage terms because there’s headroom. The same benchmark, five years later, sitting at 95%, produces SOTA-breaks measured in fractions of a percentage point because there’s nowhere left to go. The raw size of the event is dominated by where on the benchmark’s lifecycle the event occurred, not by how much capability the model actually gained.
This contaminates both the third and the fourth requirements. The “events” aren’t homogeneous because SOTA-breaks on a fresh benchmark and SOTA-breaks on a saturated one are produced by different competitive dynamics. And the size metric isn’t stable because the same underlying capability gain produces different SOTA-delta sizes depending on benchmark saturation.
You can attempt corrections. A logit transform of benchmark scores partially addresses the saturation issue by stretching the high end of the scale. Restricting to benchmarks released within a narrow time window gives you more homogeneous events but cuts the sample size. Normalizing by some measure of “remaining headroom” embeds an assumption about where the ceiling is. Each correction trades one problem for a different one, and each correction smuggles structural assumptions into the result.
The deeper concern is that even with corrections, the SOTA-break dataset may be measuring which benchmarks happened to exist when, rather than the underlying dynamics of capability development. The visible avalanches are the ones we built thermometers to measure. The framework wants the events themselves, generated by the system; the data gives us the events filtered through whatever measurement apparatus the field happened to construct.
For SOTA, we have the events but are limited by not having a clean way to size them.
Candidate 2: Compute-adjusted capability gains
A second operationalization addresses some of the saturation problems from the first by using compute as the index variable rather than calendar time. Epoch AI has assembled a public dataset of training compute estimates across notable machine learning models going back to the 1950s, expressed in FLOPs. For each compute-doubling in the historical record, you can ask: what was the largest capability gain observed during that doubling period, measured against a stable benchmark suite?
This framing treats the system as driven by compute rather than by time. Compute is more theoretically motivated as a “grain” because the scaling-laws literature gives us reasons to expect capability to be a function of compute specifically. The events become “a compute-doubling occurred between time T and time T+dt, and during that interval the largest measurable capability gain on a held benchmark was X.” The size of the event is X.
On the first requirement, this is a step down from SOTA-breaks. Compute-doublings in the historical record are countable in the dozens, not thousands. Each doubling is a single event, and we’ve had maybe twenty-five to thirty doublings in the modern deep-learning era depending on where you start the clock (25-30 doublings is places that around 2010). That’s well below the sample-size threshold for confident power-law fitting.
The scale-span requirement is satisfied well at first glance. Compute itself spans an enormous range: from mega-FLOPs in early neural network research to nearly twenty orders of magnitude more in current frontier training runs. If you’re indexing by compute, you have scale span in spades.
But. The compute range from 1950s neural networks to current frontier models is not a homogeneous range. Pre-2012 ImageNet, deep learning was a fringe research program using compute for architectures and tasks that are barely recognizable as ancestors of current systems. The 2012-2017 period was dominated by convolutional architectures. The post-2017 period is dominated by transformers. Each regime has its own capability-per-FLOP relationship, and the relationships aren’t continuous across regime boundaries. Treating a 1980s perceptron’s compute as the same kind of thing as a 2024 frontier pretraining run violates the third requirement directly. The grains are not interchangeable.
If you restrict to the transformer era to fix the homogeneity problem, you collapse the time horizon. Maybe eight years of usable data, perhaps fifteen to thirty distinct frontier-scale training runs. Now you have a homogeneous regime but you’ve fallen below the sample-size floor and lost most of the scale span.
That is Candidate 2’s tension. The compute framing fixes the saturation problem from Candidate 1 by giving us a theoretically motivated grain, but the theoretical motivation only holds within a single architectural regime, and within any single regime the historical record is too short to fit a power law. Across regimes, the data has range but no homogeneity. Within regimes, the data has homogeneity but no range.
A determined practitioner could still attempt the fit, restricting to a single regime and accepting that the result is provisional and underpowered. The honest result would not be “SOC applies or doesn’t” but “the data is consistent with too many distributions to discriminate between them.”
For compute-adjusted gains, we have a clean grain but not enough history to use it.
Candidate 3: User-facing capability events
A third operationalization moves closest to what the AI discourse actually argues about. The events are not benchmark deltas or compute-doublings but moments when a new user-facing capability became widely available. Clear examples are ChatGPT’s release in late 2022, GPT-4 in early 2023, the image-generation threshold crossing into photorealism, code completion becoming production-grade, and agents completing multi-step tasks autonomously. The size of the event is something like its economic or practical significance: how much the world reorganized around the new capability, how many workflows changed, and how much money moved.
This is the operationalization that maps most cleanly onto the dynamics the SOC framework is trying to capture: long periods where the field accumulates capability invisibly to most users, punctuated by sudden reorganizations when a capability crosses some threshold and becomes deployable. The grains are research outputs, infrastructure buildouts, alignment work, and product development; they accumulate slowly inside labs; occasionally one tips the system into a regime where a new capability ships and the world adjusts. The cadence of these events and their relative magnitudes is what the AI capability discourse implicitly cares about when it argues about “transformative” AI or “what changes next.”
But. We will struggle to fulfill the four requirements.
On sample size, we have maybe a dozen events that anyone would agree count as discrete user-facing capability moments in the modern era. A dozen events is not a power-law fit.
On scale span, the events do appear to span a wide range in their downstream impact (a niche capability becoming production-grade and ChatGPT’s release are clearly different magnitudes), but we have no agreed-upon way to measure that range. There is no standard unit for “amount the world reorganized.” Any sizing we produce is essentially editorial.
The grain requirement is undefined in a way that the previous candidates’ grain problems were not. With SOTA-breaks, the grain was a paper or model release, and the problem was that the grains were treated as homogeneous when they weren’t. With compute-adjusted gains, the grain was a FLOP, and the problem was regime discontinuity. Here, what is the grain? Research outputs? Engineering hours? Capital allocation? Each of these is a candidate, and the candidates aren’t even commensurable.
The stable-metric requirement is similarly undefined. “Economic significance” of a capability event is measurable in principle: you could try to estimate market cap shifts, productivity gains, or revenue attribution… but each measurement methodology produces a different ranking of events, and there is no canonical methodology. Whatever number you produce for the “size” of ChatGPT’s release is downstream of methodological choices that other reasonable analysts would make differently.
Everything that makes this operationalization interesting also makes it untestable. It captures the thing we want to study but it gives us no rigorous way to study it.
This isn’t a candidate measurement so much as a description of what we wish we could measure. If a research program existed that produced a defensible sizing methodology for user-facing capability events across many decades, with enough events to fit a distribution, the SOC test could be run. No such research program exists.
For user-facing events, we have the right object but no way to measure it.
The Fork in the (Model) Road
Now, let’s step back from the three operationalizations and ask what we’d actually learn if any of them could be cleanly run.
Bak’s framework, applied to AI capability development, produces a three-way diagnostic. Each branch implies a different research program and a different forecasting posture.
If the distribution of capability jumps is gaussian-like, with a characteristic event size and bell-curve falloff, then AI capability development has a built-in scale. The system is producing predictable-sized improvements through mechanisms that bound how large any single event can get. METR’s exponential framework is the natural fit for this world: roughly steady-sized improvements producing roughly steady doubling times. Part 1’s question (which curve fits the trajectory) is the right question. The curve-shape discourse is well-posed and the work to do is fitting better curves with better falsifiers.
If the distribution is power-law with an exponent in the SOC range (roughly one to three, depending on which variant of the framework you’re applying), the underlying dynamics produce no characteristic event size. The same mechanism produces small jumps and large ones, and large jumps are inevitable on long enough horizons even though they’re individually unpredictable. The trajectory is dominated by rare large events that curve-fitting cannot capture. Part 1’s question is malformed and the curve-shape discourse is studying the wrong object. The right work is characterizing the distribution and its exponent, and forecasting the probability of large jumps rather than the timing of any specific jump.
If the distribution is heavy-tailed but not power-law, such as log-normal, stretched exponential, mixture distributions, or something with a characteristic scale at the top, then some other mechanism is operating. Multiplicative noise produces log-normals; preferential attachment produces specific power-law shapes; mixtures produce composites. Each of these implies a different generative process than SOC. In this world, neither the smooth-curve frameworks nor the SOC framework cleanly applies, and the right research program is identifying which mechanism is actually generating the data.
These three branches are not interchangeable. They correspond to genuinely different worlds and would license genuinely different forecasts. The current AI discourse argues about which curve to draw without checking which of these worlds we’re in.
The diagnostic test that would tell us has not been run, because we don’t have a measurement we can run it on.
The missing discipline
Each of the three operationalizations failed differently, but the failures share a structure. SOTA-breaks gave us events without a stable way to size them. Compute-adjusted gains gave us a clean grain without enough history to use it. User-facing capability events gave us the right object without any rigorous way to measure it. The specific obstacles differ but the kind of obstacle is the same: we do not have the measurement discipline that would let us run the diagnostic.
We need to separate the model from the data. The framework is well-defined and the test is unambiguous. Given a clean dataset of capability events with stable sizing, homogeneous driving, and enough range, the SOC diagnostic could be run tomorrow. The key constraint with SOC is dataset construction (and creating the infrastructure that could produce the dataset), but ignoring SOC because of that is a mistake.
This constraint is shared by curve-fitting frameworks. METR’s exponential fit, Epoch’s compute-scaling laws, and the AI Futures Project’s scenario modeling all proceed without the measurement discipline that would let us check whether their underlying assumptions are right. They are fitting curves to data without a settled answer to the question of whether the data even admits curve-fitting as a meaningful operation. Bak’s framework makes the absence visible because it requires a different measurement to even start; the curve-fitting frameworks make the absence invisible because they assume the measurement is fine and proceed.
The point
Bak’s framework questions the primitive that the curve-fitting frameworks share. It applies to systems no less complex than AI and has 30 years of prior work behind it. It produces a specific diagnostic test with a specific signature, and three branches with three different forecasting implications.
It is worth being considered against METR’s time-horizon work, Epoch AI’s compute-scaling laws, the AI Futures Project’s scenario modeling, and Sornette’s LPPLS. Whether it will turn out to be the right framework for AI capability development is an open question. Whether the data even admits an answer to that question is a prior open question, which is itself the discovery.
The curve-shape discourse has been proceeding as if its primitive is the obviously correct one to argue about. It isn’t. There are alternatives, the alternatives have falsifiers, and the alternatives change what we’d ask in the first place.