Michael Saxon

Siegel Postdoctoral Fellow
University of Washington

I study generative AI artifacts like LLMs and text-to-image models. I make meaningful evaluations of new capabilities that are difficult to measure to improve them.

I am a Siegel Family Endowment Postdoctoral Fellow in the Tech Policy Lab and University of Washington iSchool. I got my Ph.D. in Computer Science from the University of California, Santa Barbara, advised by Prof. William Yang Wang. I was an NSF Graduate Research Fellow, a Center for Responsible ML Fellow, and a 2024 Rising Star in Generative AI. I got my M.S. and B.S. in Electrical & Computer Engineering at Arizona State University.

About

Attention! We will be presenting a tutorial entitled “The Science of Benchmarking: What's Measured, What's Missing, What's Next” at NeurIPS on December 2, 2025! Hope to see you there!

I have broad interests in generative AI, NLP, and multimodal systems. In particular, I'm interested in:

Rigorous evaluation of difficult-to-measure capabilities in language models and generative image systems. (COLM 2024^{In our work “Benchmarks as Microscopes: A call for Model Metrology” we lay out an agenda for building a science of evaluation.}, NeurIPS 2024 Spotlight^{We introduced T2IScoreScore, a meta-evaluation of text-to-image faithfulness metrics that uses “semantic error graphs” to evaluate how well metrics can rank closely related images with objective error counts.})
Building multilingual and culturally competent generative AI systems, addressing performance disparities, bias, and unique knowledge possession. (ACL 2023^{My CoCo-CroLa evaluation of “multilingual conceptual coverage” in text-to-image models was the first assessment of the multilingual capabilities of T2I systems. It is produced through a fully automated benchmark generation pipeline.}, FAccT 2023 Oral, NAACL 2024)
Advancing multimodal (primarily vision & language) generative AI systems, in particular with respect to deep semantic understanding. (EMNLP 2024^{We introduced long-context visual extractive reasoning, a more ecologically valid measure of long-context understanding in vision-language models than prior “visual needle in a haystack” tests, and demonstrated that long-context VLMs are much worse at the task than previously believed.}, Tech Crunch coverage)

I have had the wonderful opportunity to mentor over a dozen students in research over the course of my PhD.

Experience

Academic

University of Washington

Postdoctoral Scholar, Tech Policy Lab (2025–)

Advised by Aylin Caliskan

University of California, Santa Barbara

PhD student, NLP Lab, Computer Science (2020–2025)

Advised by William Yang Wang
Recipient, NSF Graduate Research Fellowship (2020)

Arizona State University

MS Computer Engineering (2018–2020)

Advised by Visar Berisha & Sethuraman Panchanathan

Arizona State University

BSE Electrical Engineering (2014–2018)

Barrett, the Honors College, Presidential Scholar (Full tuition)

Industry

AMD Research

Research Intern, Open GenAI (2024)

Working on open source generative AI initiatives

Meta AI

Research Intern, Cognitive AI/Conversational AI Research (2022)

Developed novel evaluation methods for large language models

Amazon Alexa AI

Applied Scientist Intern (2019–2021)

Web-based QA (2021)
Hybrid Science (2019, 2020)

News

11/18/2025 We will be presenting a tutorial entitled “The Science of Benchmarking: What's Measured, What's Missing, What's Next” at NeurIPS on December 2, 2025!

9/15/2025 Started my postdoc at UW.

6/2/2025 Defended my thesis!

5/8/2025 Interviewed in IEEE Spectrum about reasoning models; our ThoughtTerminator work was mentioned! [article link]

4/17/2025 New paper! ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models [arXiv]

3/29/2025 I will be joining the University of Washington for a postdoc position in September!

12/13/2024 Presented our work on meta-evaluation of text-to-image metrics at NeurIPS! [paper]

11/13/2024 Presented our work on long-context capabilities in VLMs at EMNLP! [paper]

10/7/2024 Presented our work on Model Metrology at COLM 2024! [paper]

9/25/2024 Recognized as a Rising Star in Generative AI!

8/1/2024 Invited talk on multilinguality in text-to-image models to the SALT group at Stanford

6/29/2024 Interviewed in TechCrunch on the problems with claiming long-context capabilities of VLMs, based on our LoCoVQA preprint! [link]

6/22/2024 Gave an Oral presentation at NAACL 2024 on our work on translation errors in multilingual benchmarking. See the recording! [YouTube]

5/4/2024 Had a great week visiting UMBC, Georgetown, UMD, and Johns Hopkins to present my work on rigorous measurement for text-to-image models. Check out the talk recording! [YouTube]

4/8/2024 We were suprised to find that fancy VLM-based text-to-image faithfulness metrics don't actually outperform simple correlation-based ones using our new meta-evaluation T2IScoreScore! Check out our interactive leaderboard!

3/18/2024 Excited about our work diving deeper on characterizing failure cases in translation for text-to-image model testing!

12/10/2023 Wanrong presented our (mostly Xinyi's) work analyzing how in-context learning works for LLMs at NeurIPS 2023!

11/16/2023 Gave an invited talk on my recent work on assessing and improving multilingual knowledge and capabilities in T2I models at USC ISI!

11/3/2023 Our dataset/paper proposing the task of “video infilling and prediction” for assessing reasoning capabilities of VLMs has been accepted to EMNLP!

8/6/2023 Our survey paper on self-correcting and automated correction methods for LLM workflows, “Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies” is up on arXiv!

7/13/2023 Presented CoCo-CroLa in the paper “Multilingual Conceptual Coverage in Text-to-Image Models” at ACL 2023!

6/20/2023 Gave a talk at FAccT 2023 on CoCo-CroLa and our initial findings of interesting cross-lingual biases. Watch the talk!

5/8/2023 Presenting at ICLR 2023 was a blast!

3/9/2023 Check out me and Alex's position paper, “Users are the North Star for AI Transparency,” written in collaboration with our advisor William, and profs Shiyu Chang and Zack Lipton! You'll probably like it more than the IJCAI reviewers did 😉 Preprint: [arXiv:2303.05500]

1/23/2023 2 of my papers were accepted to ICLR 2023 and one to EACL 2023! In particular I'm happy to share that WikiWhy, a new benchmark for analyzing reasoning in LMs using QA got accepted as an oral presentation! Super proud of my undergrad group to get such an honor at ICLR for their first paper! Preprint here: [arXiv:2210.12152]

12/20/2022 Check out my preprint “Multilingual Conceptual Coverage in Text-to-Image Models” on OpenReview! We quantify the degree to which T2I models including DALL-E and StableDiffusion contain representations of ~200 tangible concepts across EN, ES, DE, ZH, JA, HE, and ID. Preprint here: [OpenReview:5H2m3tCEaQ] Demo available here: [demo link]

11/18/2022 The 2022 Southern California NLP Symposium (SoCalNLP22) was a massive success! Co-chairing the program committee and participating in event organization was a great privilege and it was wonderful meeting everybody. Please check out our full event livestream [YouTube Link] and some event photos [Twitter:@m2saxon] [Twitter:@ucsbNLP]!

10/24/2022 Our general-purpose text-reference comparison metric that simulates human preferences for translation and summarization, SEScore, is now available on HuggingFace spaces! Preprint here: [arXiv:2210.05035]

10/12/2022 Our work “Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis” will appear in Findings of EMNLP 2022! Preprint here: [arXiv:2210.05035]

10/12/2022 Check out the latest preprint of my work “PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers” on arXiv! We demonstrated automated detection of spurious, annotator-driven correlations that lead to cheating features in NLI. Preprint here: [arXiv:2112.09237]

6/6/2022 Excited to start my 2022 AI Research Scientist Internship at Meta in Menlo Park!

12/3/2021 Our work “Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer” will appear at AAAI 2022! Preprint here: [arXiv:2110.02950]

11/8/2021 Had a great time presenting our Disclosive Transparency work at EMNLP 2021! Our work was even highlighted in an EMNLP overview article! Oral presentation prerecording: [YouTube]

10/1/2021 Our work “Counterfactual Maximum Likelihood Estimation for Training Deep Networks” will appear at NeurIPS 2021! Preprint here: [arXiv:2106.03831]

9/23/2021 Our work “Modeling Disclosive Transparency in NLP Application Descriptions” will appear at EMNLP 2021 as an oral presentation! Preprint here: [arXiv:2101.00433]

9/13/2021 I was profiled on the Amazon Science Blog about my experience doing multiple applied science internships with the company! Article here: [amazon.science]