Exploring Data Counterfactuals

The counterfactual grid: trying to unify valuation, scaling laws, selection, poisoning, and leverage.

This page doubles as the permanent link I keep referencing whenever the conversation turns to “data counterfactuals”. The interactive grid embedded in the webpage lets you poke at the math, and this memo captures why this frame unifies so many strands of data work.

In short, data valuation, data scaling, data selection, data poisoning, nascent algorithmic collective action (ACA), and even some concepts in the privacy literature (like differential privacy) and in data-centric ML (data augmentation) all touch on a shared idea: exploring the counterfactuals implied by data. Many methods seek to answer "what if the data was different", or “what if we had picked these data instead of those?” or “what if the world produced different data altogether?”

Data valuation techniques often tell us how a relevant metric (e.g. test loss) would change if a data point was missing/added/upweighted/downweighted, or a group of data points were missing/added/upweighted/downweighted. Data scaling tells us what would happen is our training data size was 10x-ed. Data selection tries to tell us which data points we should pick from a given set. Data poisoning tells us what will happen when data is manipulated. ACA is focused on campaigns by people that cause data to change. And some privacy concepts like differential privacy and data-centric ML concepts like augmentation involve intentionally modifying data to promote some outcome (more privacy, better performance).

A key distinction runs through this space. Some techniques only explore counterfactual choices over the data that already exist—subsets, reweightings, synthetic perturbations. Others try to act on, or incentivize, people so that the world itself changes and produces new examples. The latter camp manufactures real counterfactuals, and that difference matters when we reason about agency, incentives, and responsibility.

The idea of a "data counterfactual" grid came up in CMPT 419 course (student questions remain invaluable for advancing these discussions).

Here's the idea: Imagine enumerating every possible training set as the rows of a grid and every possible evaluation set as the columns. To make this tractable for a toy visualization, as we'll see below, perhaps just start by imagining we have 4 observations or 4 big groups of observation, A, B, C, and D. Each cell stores the performance you would observe if you trained on that row and tested on that column, with the evaluation metric providing the third dimension. That giant tensor is the full landscape of data counterfactuals.

Once you see the space that way a lot of familiar ideas fall into place:

  • Observation- and group-level values, Shapley values, Beta Shapley, and related notions are all aggregations over carefully chosen slices of that grid.
  • We can understand what any valuation method is “really doing” by tracing how it walks the grid, which makes it easier to relate Beta Shapley, vanilla Shapley, leave-one-out, etc.
  • Data scaling laws become simple regressions on the average cores that lie along increasing row sizes
  • Data selection and data leverage interventions—data strikes, boycotts, targeted contributions—are paths that move us to different rows and therefore different outcomes.

Data strikes lower performance by nudging AI operators toward less favorable rows. Data poisoning does the same, but to tell the full story you have to augment the row universe with every variant of every data point—an explosion of possible worlds that the grid keeps conceptually tidy. This same view also clarifies how strikes or poisoning impact evaluation: if we only ever test on a thin slice of the columns, we risk missing exactly the failures the actions were designed to induce.

The grid also gives us a cost model. “Unveiling” the grid—actually measuring enough cells to run a Shapley calculation or a scaling-law fit—is often the dominant expense in data valuation, so we can reason about when the marginal benefit of better accounting beats the marginal cost of more evaluations. Likewise, the metaphor helps us price the work required to “generate” new parts of the grid. Think of it as a board game where you drop fresh tiles to build the world: each new tile represents data labor, annotation, or incentive design, and the grid tells us whether that tile changes downstream outcomes enough to justify the effort.

(The WIP "game view" available via the "Open Tactical Board" link tries to make this even more concrete).

That is why I find the grid metaphor so useful. It ties unknown unknowns in ML evaluation, data leverage actions, selection heuristics, and privacy interventions back to the same concrete object. Exploring data counterfactuals is the umbrella activity; the particular community practice is just the lens we use to look at a specific subset of the grid.

Prefer a more visual version?Launch a “game-style” explorer
Open tactical board

Guided examples

Tap a preset to auto-configure the explorer and narrate a simple scenario.

Pick one of the examples above to auto-configure the grid and walk through the narration.

Controls

Universe size
4
Cell metric
Palette
Focus point
Baseline train row
Eval column
Show numbers

Methods

Computed

Shapley value asks: “on average, how much does the focus point change the score when added to a partial training set?”
φ0.4688 from 8 pairs at eval A.
|S|Avg marginal Δ#pairs
01.00001
10.50003
20.33333
30.25001

Edit data / world

Think of this as a filter matrix: operator view = filter ⊙ real-world grid.
This is a simple additive corruption; in practice you’d use learned adversarial vectors.

Counterfactual Grid

Operator view reflects poison/noise/world edits; Real world shows the untouched matrix. 3D comparison idea: future small-multiple stacks or height fields.
A
B
C
D
AB
AC
AD
BC
BD
CD
ABC
ABD
ACD
BCD
ABCD
A
B
C
D
AB
AC
AD
BC
BD
CD
ABC
ABD
ACD
BCD
ABCD
Color shows the cell value; amber rings = cells used as inputs, cyan rings = the paired/selected cells. White outline marks your current cell.

Appendix / Extras (TODO: add links, etc.)

This section is still a placeholder—TODO: add outbound links, citations, and deeper dives.

Influence (TODO: links)

Influence asks: “if I nudge the weight of this one point, how does the score change?” On this grid we approximate that by the leave‑one‑out difference at the baseline row, so the ∆ value doubles as the finite‑difference influence estimate.

Meta-gradient / acquisition (TODO: links)

Meta-gradient reasoning looks at how the score changes as you grow the training set, hinting at “what should we add next?” It’s the intuition behind the Scaling view above—fit a slope on k to decide where diminishing returns set in.

Poison / leverage (TODO: links)

Poison captures data corruption or strikes. In the Explorer we model it by degrading rows that contain the focus point, mirroring leverage actions without exploding the grid.

TracIn / influence curves (TODO)

TracIn scores examples via gradient inner products along the training trajectory. The grid could approximate it by storing synthetic gradients per row/column and summing their directional dot products, highlighting which cells push an eval column up or down.

Selection / acquisition planning (TODO: links)

The planned selection view will scan all rows of size k and report the best subset for a chosen evaluation column—exactly the data-curation question (“which partial dataset should we ship?”). Until that lands, use the Scaling view plus the raw grid to reason about those k-slices.