Related work

A lightweight map of neighboring literatures, with each area curated as a Semble collection and synced into the site at build time.

21 areas 156 papers Updated May 28, 2026

This page is generated from a set of curated Semble collections. Each related area here corresponds to a collection where I add, remove, and reorganize papers as the reading map changes.

When the site builds, Astro pulls the latest public Semble data and turns those collections into the grouped shelves you see on this page. The result is still intentionally non-exhaustive: each area is meant as a hand-curated starting shelf rather than a comprehensive bibliography.

Core areas

collective action

5 papers

Promoting User Data Autonomy During the Dissolution of a Monopolistic Firm

Rushabh Solanki, Elliot Creager / 2nd Workshop on Regulatable ML at NeurIPS 2024 (2024)
Algorithmic Collective Action in Machine Learning

Moritz Hardt, Eric Mazumdar, Celestine Mendler-Dünner, Tijana Zrnic / International Conference on Machine Learning (2023)
Can "Conscious Data Contribution" Help Users to Exert "Data Leverage" Against Technology Companies?

Nicholas Vincent, Brent Hecht / Proceedings of the ACM on Human-Computer Interaction (2021)
Data Leverage: A Framework for Empowering the Public in its Relationship with Technology Companies

Nicholas Vincent, Hanlin Li, Nicole Tilly, Stevie Chancellor, Brent Hecht / FAccT (2021)
“Data Strikes”: Evaluating the Effectiveness of a New Form of Collective Action Against Technology Companies

Nicholas Vincent, Brent Hecht, Shilad Sen / The World Wide Web Conference (2019)

data dividends

7 papers

Sharing the Algorithm: The Tax Solution to Generative AI

Jeremy Bearer-Friend, Sarah Polcz / Columbia Journal of Tax Law (2025)
Sharing the Winnings of AI with Data Dividends: Challenges with "Meritocratic" Data Valuation

Nicholas Vincent, Brent Hecht / EAAMO '23 (2023)
A Data Dividend That Works: Steps Toward Building an Equitable Data Economy

Yakov Feygin, Brent Hecht, Matthew Prewitt, Hanlin Li, Nicholas Vincent, Chirag Lala, Luisa Scarcella / Berggruen Institute white paper (2021)
Nonrivalry and the Economics of Data

Charles I. Jones, Christopher Tonetti / American Economic Review (2020)
Economic Impact and Feasibility of Data Dividends

Tarun Wadhwa / Data Catalyst report (2020)
Mapping the Potential and Pitfalls of "Data Dividends" as a Means of Sharing the Profits of Artificial Intelligence

Nicholas Vincent, Yichun Li, Renee Zha, Brent Hecht / arXiv (2019)
Should We Treat Data as Labor? Moving Beyond “Free”

Imanol Arrieta-Ibarra, Leonard Goff, Diego Jiménez-Hernández, Jaron Lanier, E. Glen Weyl / AEA Papers and Proceedings (2018)

data provenance and source attribution

5 papers

WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data

Xinyang Lu, Jingtan Wang, Zitong Zhao, Zhongxiang Dai, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low / Findings of the Association for Computational Linguistics: ACL 2025 (2025)
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah Smith, Luke Zettlemoyer / International Conference on Learning Representations (2024)
A large-scale audit of dataset licensing and attribution in AI

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker / Nature Machine Intelligence (2024)
Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford / Communications of the ACM (2021)
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner / Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)

influence

19 papers

Bayesian Influence Functions for Hessian-Free Data Attribution

Philipp Alexander Kreer, Wilson Wu, Maxwell Adam, Zach Furman, Jesse Hoogland / ICLR (2026)
Rescaled Influence Functions: Accurate Data Attribution in High Dimension

Ittai Rubinstein, Samuel B. Hopkins / The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Taming Hyperparameter Sensitivity in Data Attribution: Practical Selection Without Costly Retraining

Weiyi Wang, Junwei Deng, Yuzheng Hu, Shiyuan Zhang, Xirui Jiang, Runting Zhang, Han Zhao, Jiaqi W. Ma / The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Distributional Training Data Attribution: What do Influence Functions Sample?

Bruno Kacper Mlodozeniec, Isaac Reid, Samuel Power, David Krueger, Murat A Erdogdu, Richard E. Turner, Roger Baker Grosse / The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Better Training Data Attribution via Better Inverse Hessian-Vector Products

Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila A. McIlraith, Roger Baker Grosse / NeurIPS (2025)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

Ian Covert, Tatsunori Hashimoto, Chanwoo Kim, Su-In Lee, James Zou / Advances in Neural Information Processing Systems 37 (2024)
Training Data Attribution via Approximate Unrolling

Juhan Bae, Roger Grosse, Wu Lin, Jonathan Lorraine / Advances in Neural Information Processing Systems 37 (2024)
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Yongchan Kwon, Eric Wu, Kevin Wu, James Y Zou / International Conference on Learning Representations (2024)
A Bayesian Approach To Analysing Training Data Attribution In Deep Learning

Elisa Nguyen, Minjoon Seo, Seong Joon Oh / Advances in Neural Information Processing Systems (2023)
TRAK: Attributing Model Behavior at Scale

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, Aleksander Madry / International Conference on Machine Learning (2023)
If Influence Functions are the Answer, Then What is the Question?

Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, Roger B. Grosse / Advances in Neural Information Processing Systems (2022)
Datamodels: Understanding Predictions with Data and Data with Predictions

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry / International Conference on Machine Learning (2022)
Revisiting Methods for Finding Influential Examples

Karthikeyan K, Anders Søgaard / arXiv (2021)
Influence Functions in Deep Learning Are Fragile

Samyadeep Basu, Phil Pope, Soheil Feizi / International Conference on Learning Representations (2021)
On Second-Order Group Influence Functions for Black-Box Predictions

Samyadeep Basu, Xuchen You, Soheil Feizi / International Conference on Machine Learning (2020)
Estimating Training Data Influence by Tracing Gradient Descent

Garima Pruthi, Frederick Liu, Satyen Kale, Mukund Sundararajan / Advances in Neural Information Processing Systems (2020)
On the Accuracy of Influence Functions for Measuring Group Effects

Pang Wei Koh, Kai-Siang Ang, Hubert Teo, Percy S. Liang / Advances in Neural Information Processing Systems (2019)
Representer Point Selection for Explaining Deep Neural Networks

Chih-Kuan Yeh, Joon Kim, Ian En-Hsu Yen, Pradeep K. Ravikumar / Advances in Neural Information Processing Systems (2018)
Understanding Black-box Predictions via Influence Functions

Pang Wei Koh, Percy Liang / International Conference on Machine Learning (2017)

scaling laws

6 papers

Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos / arXiv (2022)
Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre / arXiv.org (2022)
Deep Double Descent: Where Bigger Models and More Data Hurt

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever / Journal of Statistical Mechanics: Theory and Experiment (2021)
Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish / arXiv.org (2020)
Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei / arXiv.org (2020)
Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou / arXiv.org (2017)

selection and coresets

4 papers

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia / International Conference on Machine Learning (2024)
GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer / Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Coresets for Data-efficient Training of Machine Learning Models

Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec / International Conference on Machine Learning (2020)
Active Learning for Convolutional Neural Networks: A Core-Set Approach

Ozan Sener, Silvio Savarese / International Conference on Learning Representations (2018)

poisoning

5 papers

Rethinking Backdoor Attacks

Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry / ICML (2023)
BadNets: Evaluating Backdooring Attacks on Deep Neural Networks

Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg / IEEE Access (2019)
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, Dawn Song / arXiv (2017)
Certified Defenses for Data Poisoning Attacks

Jacob Steinhardt, Pang Wei Koh, Percy S. Liang / Advances in Neural Information Processing Systems (2017)
Poisoning Attacks against Support Vector Machines

Battista Biggio, Blaine Nelson, Pavel Laskov / Proceedings of the 29th International Conference on Machine Learning (2012)

semivalues

26 papers

On the Impact of the Utility in Semivalue-based Data Valuation

Mélissa Tamine, Benjamin Heymann, Maxime Vono, Patrick Loiseau / ICLR (2026)
Semivalue-based data valuation is arbitrary and gameable

Hannah Diehl, Ashia C. Wilson / arXiv (2025)
Data Shapley in One Training Run

Jiachen (Tianhao) Wang, Prateek Mittal, Dawn Song, Ruoxi Jia / International Conference on Learning Representations (2025)
SAVA: Scalable Learning-Agnostic Data Valuation

Samuel Kessler, Tam Le, Vu Nguyen / ICLR (2025)
An Instrumental Value for Data Production and its Application to Data Pricing

Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Haifeng Xu / International Conference on Machine Learning (2025)
DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

Xiao Tian, Rachael Hwee Ling Sim, Jue Fan, Bryan Kian Hsiang Low / AAAI (2024)
Data Distribution Valuation

Giulia Fanti, Chuan-Sheng Foo, Bryan Low, Shuaiqi Wang, Xinyi Xu / Advances in Neural Information Processing Systems 37 (2024)
Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia / International Conference on Machine Learning (2024)
Distributionally Robust Data Valuation

Xiaoqiang Lin, Xinyi Xu, Zhaoxuan Wu, See-Kiong Ng, Bryan Kian Hsiang Low / International Conference on Machine Learning (2024)
Data Valuation in the Absence of a Reliable Validation Set

Himanshu Jahagirdar, Jiachen T. Wang, Ruoxi Jia / Transactions on Machine Learning Research (2024)
Robust Data Valuation with Weighted Banzhaf Values

Weida Li, Yaoliang Yu / Advances in Neural Information Processing Systems (2023)
Data Valuation Without Training of a Model

Nohyun Ki, Hoyong Choi, Hye Won Chung / ICLR (2023)
2D-Shapley: A Framework for Fragmented Data Valuation

Zhihong Liu, Hoang Anh Just, Xiangyu Chang, Xi Chen, Ruoxi Jia / International Conference on Machine Learning (2023)
OpenDataVal: a Unified Benchmark for Data Valuation

Kevin Jiang, Weixin Liang, James Zou, Yongchan Kwon / Advances in Neural Information Processing Systems (2023)
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Yongchan Kwon, James Zou / International Conference on Machine Learning (2023)
LAVA: Data Valuation without Pre-Specified Learning Algorithms

Hoang Anh Just, Feiyang Kang, Jiachen T. Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia / ICLR (2023)
Data Banzhaf: A Robust Data Valuation Framework for Machine Learning

Jiachen T. Wang, Ruoxi Jia / International Conference on Artificial Intelligence and Statistics (2023)
Data Appraisal Without Data Sharing

Xinlei Xu, Awni Hannun, Laurens Van Der Maaten / International Conference on Artificial Intelligence and Statistics (2022)
DAVINZ: Data Valuation using Deep Neural Networks at Initialization

Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low / International Conference on Machine Learning (2022)
Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

Yongchan Kwon, James Zou / International Conference on Artificial Intelligence and Statistics (2022)
Validation Free and Replication Robust Volume-based Data Valuation

Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, Bryan Kian Hsiang Low / Advances in Neural Information Processing Systems (2021)
A Distributional Framework For Data Valuation

Amirata Ghorbani, Michael Kim, James Zou / International Conference on Machine Learning (2020)
Towards Efficient Data Valuation Based on the Shapley Value

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, Costas J. Spanos / The 22nd International Conference on Artificial Intelligence and Statistics (2019)
Efficient task-specific data valuation for nearest neighbor algorithms

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song / Proceedings of the VLDB Endowment (2019)
Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani, James Zou / International Conference on Machine Learning (2019)
A Value for n-Person Games

L. S. Shapley / Contributions to the Theory of Games II (1953)

user-generated content

9 papers

WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data

Xinyang Lu, Jingtan Wang, Zitong Zhao, Zhongxiang Dai, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low / Findings of the Association for Computational Linguistics: ACL 2025 (2025)
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah Smith, Luke Zettlemoyer / International Conference on Learning Representations (2024)
A large-scale audit of dataset licensing and attribution in AI

Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker / Nature Machine Intelligence (2024)
LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev / NeurIPS 2022 Datasets and Benchmarks (2022)
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner / Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu / Journal of Machine Learning Research (2020)
Wiki-40B: Multilingual Language Model Dataset

Mandy Guo, Zihang Dai, Denny Vrandečić, Rami Al-Rfou’ / Proceedings of the Twelfth Language Resources and Evaluation Conference (2020)
Measuring the Importance of User-Generated Content to Search Engines

Nicholas Vincent, Isaac Johnson, Patrick Sheehan, Brent Hecht / Proceedings of the International AAAI Conference on Web and Social Media (2019)
DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

Martin Brümmer, Milan Dojchinovski, Sebastian Hellmann / Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (2016)

Related areas

active learning

6 papers

Data Acquisition via Experimental Design for Data Markets

Baihe Huang, Michael Jordan, Sai Praneeth Karimireddy, Charles Lu, Ramesh Raskar, Praneeth Vepakomma / Advances in Neural Information Processing Systems 37 (2024)
Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, Alekh Agarwal / International Conference on Learning Representations (2020)
Active Learning Literature Survey

Burr Settles / University of Wisconsin-Madison, Computer Sciences Technical Report 1648 (2009)
Active Learning with Statistical Models

D. A. Cohn, Z. Ghahramani, M. I. Jordan / Journal of Artificial Intelligence Research (1996)
A Sequential Algorithm for Training Text Classifiers

David D. Lewis, William A. Gale / SIGIR ’94 (1994)
Query by committee

H. S. Seung, M. Opper, H. Sompolinsky / Proceedings of the fifth annual workshop on Computational learning theory (1992)

augmentation and curriculum

5 papers

Randaugment: Practical automated data augmentation with a reduced search space

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, Quoc V. Le / 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020)
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo / Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
AutoAugment: Learning Augmentation Strategies From Data

Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le / Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz / International Conference on Learning Representations (2018)
Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Jason Weston / Proceedings of the 26th Annual International Conference on Machine Learning (2009)

causality

5 papers

Collaborative Causal Inference with Fair Incentives

Rui Qiao, Xinyi Xu, Bryan Kian Hsiang Low / International Conference on Machine Learning (2023)
Causality

Judea Pearl / Cambridge University Press (2009)
Reducing Bias in Observational Studies Using Subclassification on the Propensity Score

Paul R. Rosenbaum, Donald B. Rubin / Journal of the American Statistical Association (1984)
The Central Role of the Propensity Score in Observational Studies for Causal Effects

Paul R. Rosenbaum, Donald B. Rubin / Biometrika (1983)
Estimating causal effects of treatments in randomized and nonrandomized studies.

Donald B. Rubin / Journal of Educational Psychology (1974)

data minimization

19 papers

SoK: Data Minimization in Machine Learning

Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski / SaTML (2026)
Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams

Ted Shaowang, Shinan Liu, Jonatas Marques, Nick Feamster, Sanjay Krishnan / PVLDB (2025)
From Principle to Practice: Vertical Data Minimization for Machine Learning

Robin Staab, Nikola Jovanović, Mislav Balunović, Martin Vechev / IEEE Symposium on Security and Privacy (2024)
The Data Minimization Principle in Machine Learning

Prakhar Ganesh, Cuong Tran, Reza Shokri, Ferdinando Fioretto / Regulatable ML Workshop at NeurIPS (2024)
Privacy-Preserving Quantile Treatment Effect Estimation for Randomized Controlled Trials

Leon Yao, Paul Yiming Li, Jiannan Lu / Conference on Digital Experimentation (2024)
Data Minimization at Inference Time

Cuong Tran, Ferdinando Fioretto / NeurIPS (2023)
The Interplay Between Machine Learning and Data Minimization Under the GDPR: The Case of Google's Topics API

Cornelius Witt, Jan De Bruyne / International Data Privacy Law (2023)
Configurable Per-Query Data Minimization for Privacy-Compliant Web APIs

Frank Pallas, David Hartmann, Paul Heinrich, Josefine Kipke, Elias Grünewald / International Conference on Web Engineering (2022)
Learning to Limit Data Collection via Scaling Laws: A Computational Interpretation for the Legal Principle of Data Minimization

Divya Shanmugam, Fernando Diaz, Samira Shabanian, Michele Finck, Asia J. Biega / FAccT (2022)
Data Minimization for GDPR Compliance in Machine Learning Models

Abigail Goldsteen, Gilad Ezov, Ron Shmelkin, Micha Moffie, Ariel Farkash / AI and Ethics (2022)
Randomized Controlled Trials without Data Retention

Winston Chou / Conference on Digital Experimentation (2021)
Operationalizing the Legal Principle of Data Minimization for Personalization

Asia J. Biega, Peter Potash, Hal Daumé III, Fernando Diaz, Michèle Finck / SIGIR (2020)
A Data Minimization Model for Embedding Privacy into Software Systems

Awanthika Senarath, Nalin Asanka Gamagedara Arachchilage / Computers & Security (2019)
Monitoring Data Minimisation

Srinivas Pinisetty, Thibaud Antignac, David Sands, Gerardo Schneider / arXiv (2018)
Data Minimisation: A Language-Based Approach

Thibaud Antignac, David Sands, Gerardo Schneider / IFIP SEC (2017)
Data Minimisation in Communication Protocols: A Formal Analysis Framework and Application to Identity Management

Meilof Veeningen, Benne de Weger, Nicola Zannone / International Journal of Information Security (2014)
Privacy Architectures: Reasoning About Data Minimisation and Integrity

Thibaud Antignac, Daniel Le Métayer / Security and Trust Management (2014)
Privacy by Design: A Formal Framework for the Analysis of Architectural Choices

Daniel Le Métayer / CODASPY (2013)
Privacy by Design

Peter Schaar / Identity in the Information Society (2010)

distillation

4 papers

Dataset Condensation With Distribution Matching

Bo Zhao, Hakan Bilen / WACV (2023)
Dataset Condensation with Gradient Matching

Bo Zhao, Konda Reddy Mopuri, Hakan Bilen / ICLR (2021)
Dataset Distillation

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, Alexei A. Efros / International Conference on Learning Representations (2018)
Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, Jeffrey Dean / NeurIPS Deep Learning and Representation Learning Workshop (2015)

fairness via data interventions

7 papers

Characterizing Fairness Over the Set of Good Models Under Selective Labels

Amanda Coston, Ashesh Rambachan, Alexandra Chouldechova / International Conference on Machine Learning (2021)
Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford / Communications of the ACM (2021)
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification

Joy Buolamwini, Timnit Gebru / Conference on Fairness, Accountability and Transparency (2018)
Optimized Pre-Processing for Discrimination Prevention

Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, Kush R. Varshney / Advances in Neural Information Processing Systems (2017)
Certifying and Removing Disparate Impact

Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian / Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2015)
Learning Fair Representations

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, Cynthia Dwork / International Conference on Machine Learning (2013)
Data preprocessing techniques for classification without discrimination

Faisal Kamiran, Toon Calders / Knowledge and Information Systems (2012)

membership inference

6 papers

Detecting Non-Membership in LLM Training Data via Rank Correlations

Pranav Shetty, Mirazul Haque, Zhiqiang Ma, Xiaomo Liu / EACL (2026)
Membership Inference Attacks From First Principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramèr / 2022 IEEE Symposium on Security and Privacy (SP) (2022)
Extracting Training Data from Large Language Models

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel / 30th USENIX Security Symposium (USENIX Security 21) (2021)
Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, Somesh Jha / arXiv.org (2018)
Membership Inference Attacks Against Machine Learning Models

Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov / 2017 IEEE Symposium on Security and Privacy (SP) (2017)
Differential Privacy

Cynthia Dwork / Lecture Notes in Computer Science (2006)

meta-learning

3 papers

DataRater: Meta-Learned Dataset Curation

Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeffrey Dean, Hado van Hasselt, David Silver / NeurIPS 2025 (2025)
Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng / Advances in Neural Information Processing Systems (2019)
Learning to Reweight Examples for Robust Deep Learning

Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun / International Conference on Machine Learning (2018)

model collapse

3 papers

AI models collapse when trained on recursively generated data

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, Yarin Gal / Nature (2024)
The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson / arXiv (2023)
Self-Consuming Generative Models Go MAD

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, Richard G. Baraniuk / arXiv.org (2023)

reinforcement learning for data valuation

1 papers

Data Valuation using Reinforcement Learning

Jinsung Yoon, Sercan Arik, Tomas Pfister / International Conference on Machine Learning (2020)

training dynamics

4 papers

Data Valuation Without Training of a Model

Nohyun Ki, Hoyong Choi, Hye Won Chung / ICLR (2023)
Deep Learning on a Data Diet: Finding Important Examples Early in Training

Mansheej Paul, Surya Ganguli, Gintare Karolina Dziugaite / Advances in Neural Information Processing Systems (2021)
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi / Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
An Empirical Study of Example Forgetting during Deep Neural Network Learning

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon / ICLR (2019)

unlearning

7 papers

Unlearning Traces the Influential Training Data of Language Models

Masaru Isonuma, Ivan Titov / Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2024)
Descent-to-Delete: Gradient-Based Methods for Machine Unlearning

Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi / Algorithmic Learning Theory (2021)
Machine Unlearning

Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, Nicolas Papernot / IEEE Symposium on Security and Privacy (S&P) (2021)
Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks

Aditya Golatkar, Alessandro Achille, Stefano Soatto / 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Certified Data Removal from Machine Learning Models

Chuan Guo, Tom Goldstein, Awni Hannun, Laurens Van Der Maaten / International Conference on Machine Learning (2020)
Making AI Forget You: Data Deletion in Machine Learning

Antonio Ginart, Melody Guan, Gregory Valiant, James Zou / Advances in Neural Information Processing Systems (2019)
Towards Making Systems Forget with Machine Unlearning

Yinzhi Cao, Junfeng Yang / 2015 IEEE Symposium on Security and Privacy (2015)