The smallest useful move

The simplest data counterfactual is leave-one-out: compare a training world that includes one point with the nearby world in which that point is removed. From there you can ask the same question about groups, fixed-size subsets, synthetic replacements, corrupted examples, withheld data, or coordinated withdrawal.

In that sense, a data counterfactual is just a concrete what-if question about the data used to train a model.

Leave-one-out toy example

A
B
C
D
ABCD
0.92
0.88
0.85
0.82
ACD
0.78
0.55
0.82
0.80
Here the lower row leaves out B. The largest drop lands on evaluation point B, which is the intuition many attribution methods are trying to formalize.

A grid for seeing the space

The site uses a simple teaching model: imagine possible training sets as rows and possible evaluation slices as columns. Each cell stores the performance you would observe for that train-eval pairing.

Real systems do not literally enumerate this whole grid. The point of the metaphor is to make comparisons visible: which row changed, where the effect landed, and how large the difference was.

Toy world with four observations

A
B
C
D
AB
0.85
0.72
0.45
0.38
ABC
0.88
0.80
0.75
0.52
ABCD
0.92
0.88
0.85
0.82
ACD
0.78
0.55
0.82
0.80
Rows are possible training sets. Columns are evaluation slices. The interesting thing is usually the difference between nearby cells or rows.

What the idea helps connect

The framing is intentionally broad. It can pull together data valuation, scaling, selection, dataset distillation, poisoning, some forms of privacy analysis, some fairness interventions, and strategic collective action around data.

I am not claiming those fields are formally identical. I am claiming they often become easier to compare once you view them as different ways of changing the training data and comparing the outcome.

That comparison should not make the differences disappear. In some cases the “data change” is a technical intervention inside an optimization pipeline. In others it is a dispute over labor, governance, privacy, or institutional power.

  • Value and attribution

    Leave-one-out, influence functions, TracIn, and Shapley-style methods all ask, in different ways, which points or groups are doing the work.

  • Selection and compression

    Active learning, coresets, curriculum learning, and dataset distillation ask which rows are worth keeping, labeling, or synthesizing.

  • Robustness, privacy, and repair

    Poisoning, privacy interventions, and some fairness-by-data methods all study what happens when the training data are corrupted, hidden, repaired, or reweighted.

  • Strategic collective action

    Data strikes, contribution campaigns, and bargaining interventions change the data-generating process itself. They are not just analytical transformations of a dataset; they are sociotechnical conflicts over power, consent, and leverage.

How to read the site

The pages are arranged more like an essay with appendices than a landing page. Start wherever matches the question you already have, then move outward.