The counterfactual grid: trying to unify valuation, scaling laws, selection, poisoning, and leverage.
This page doubles as the permanent link I keep referencing whenever the conversation turns to “data counterfactuals”. The interactive grid embedded in the webpage lets you poke at the math, and this memo captures why this frame unifies so many strands of data work.
In short, data valuation, data scaling, data selection, data poisoning, nascent algorithmic collective action (ACA), and even some concepts in the privacy literature (like differential privacy) and in data-centric ML (data augmentation) all touch on a shared idea: exploring the counterfactuals implied by data. Many methods seek to answer "what if the data was different", or “what if we had picked these data instead of those?” or “what if the world produced different data altogether?”
Data valuation techniques often tell us how a relevant metric (e.g. test loss) would change if a data point was missing/added/upweighted/downweighted, or a group of data points were missing/added/upweighted/downweighted. Data scaling tells us what would happen is our training data size was 10x-ed. Data selection tries to tell us which data points we should pick from a given set. Data poisoning tells us what will happen when data is manipulated. ACA is focused on campaigns by people that cause data to change. And some privacy concepts like differential privacy and data-centric ML concepts like augmentation involve intentionally modifying data to promote some outcome (more privacy, better performance).
A key distinction runs through this space. Some techniques only explore counterfactual choices over the data that already exist—subsets, reweightings, synthetic perturbations. Others try to act on, or incentivize, people so that the world itself changes and produces new examples. The latter camp manufactures real counterfactuals, and that difference matters when we reason about agency, incentives, and responsibility.
The idea of a "data counterfactual" grid came up in CMPT 419 course (student questions remain invaluable for advancing these discussions).
Here's the idea: Imagine enumerating every possible training set as the rows of a grid and every possible evaluation set as the columns. To make this tractable for a toy visualization, as we'll see below, perhaps just start by imagining we have 4 observations or 4 big groups of observation, A, B, C, and D. Each cell stores the performance you would observe if you trained on that row and tested on that column, with the evaluation metric providing the third dimension. That giant tensor is the full landscape of data counterfactuals.
Once you see the space that way a lot of familiar ideas fall into place:
- Observation- and group-level values, Shapley values, Beta Shapley, and related notions are all aggregations over carefully chosen slices of that grid.
- We can understand what any valuation method is “really doing” by tracing how it walks the grid, which makes it easier to relate Beta Shapley, vanilla Shapley, leave-one-out, etc.
- Data scaling laws become simple regressions on the average cores that lie along increasing row sizes
- Data selection and data leverage interventions—data strikes, boycotts, targeted contributions—are paths that move us to different rows and therefore different outcomes.
Data strikes lower performance by nudging AI operators toward less favorable rows. Data poisoning does the same, but to tell the full story you have to augment the row universe with every variant of every data point—an explosion of possible worlds that the grid keeps conceptually tidy. This same view also clarifies how strikes or poisoning impact evaluation: if we only ever test on a thin slice of the columns, we risk missing exactly the failures the actions were designed to induce.
The grid also gives us a cost model. “Unveiling” the grid—actually measuring enough cells to run a Shapley calculation or a scaling-law fit—is often the dominant expense in data valuation, so we can reason about when the marginal benefit of better accounting beats the marginal cost of more evaluations. Likewise, the metaphor helps us price the work required to “generate” new parts of the grid. Think of it as a board game where you drop fresh tiles to build the world: each new tile represents data labor, annotation, or incentive design, and the grid tells us whether that tile changes downstream outcomes enough to justify the effort.
(The WIP "game view" available via the "Open Tactical Board" link tries to make this even more concrete).
That is why I find the grid metaphor so useful. It ties unknown unknowns in ML evaluation, data leverage actions, selection heuristics, and privacy interventions back to the same concrete object. Exploring data counterfactuals is the umbrella activity; the particular community practice is just the lens we use to look at a specific subset of the grid.