Publications

First Author Only All Author Positions

ICLR/NeurIPS/ICML *ACL, EMNLP & COLM Computer Vision Speech Preprints Other

Showing 0 out of 0 references

2025

Culture is Everywhere: A Call for Intentionally Cultural Evaluation
Juhyun Oh, Inha Cha, Michael Saxon, Hyunseung Lim, Shaily Bhatt, Alice Oh

EMNLP 2025

ACL Anthology PDF Abstract BibTeX

Abstract: The prevailing “trivia-centered paradigm” for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly “neutral” evaluation settings.In this position paper, we argue for intentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don’t know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.

@article{oh2025culture,
    author = "Oh, Juhyun and Cha, Inha and Saxon, Michael and Lim, Hyunseung and Bhatt, Shaily and Oh, Alice",
    title = "Culture is Everywhere: A Call for Intentionally Cultural Evaluation",
    journal = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    year = "2025",
    abstract = "The prevailing “trivia-centered paradigm” for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly “neutral” evaluation settings.In this position paper, we argue for intentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don’t know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.",
  url = {https://aclanthology.org/2025.findings-emnlp.1043/},
}

Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge
Eshaan Tanwar, Anwoy Chatterjee, Michael Saxon, Alon Albalak, William Yang Wang, Tanmoy Chakraborty

EMNLP 2025 Senior Area Chair Highlight Award

ACL Anthology PDF Abstract BibTeX

Abstract: Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models’ comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models’ accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.

@article{tanwar2025do,
    author = "Tanwar, Eshaan and Chatterjee, Anwoy and Saxon, Michael and Albalak, Alon and Wang, William Yang and Chakraborty, Tanmoy",
    title = "Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge",
    journal = "Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2025",
    abstract = "Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models’ comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models’ accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.",
  url = {https://aclanthology.org/2025.emnlp-main.756/},
}

ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Michael Saxon*, Xiao Pu*, Wenyue Hua, William Yang Wang

COLM 2025

OpenReview PDF Abstract BibTeX Press Coverage

Abstract: Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.

@article{pu2025thoughtterminator,
    author = "Pu, Xiao and Saxon, Michael and Hua, Wenyue and Wang, William Yang",
    title = "ThoughtTerminator: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models",
    journal = "Second Conference on Language Modeling (COLM 2025)",
    year = "2025",
    abstract = "Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.",
  url = {https://openreview.net/forum?id=oHR862dpMC},
}

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, Shiyu Chang

ICCV 2025

Official PDF BibTeX

@article{wu2024vsp,
    author = "Wu, Qiucheng and Zhao, Handong and Saxon, Michael and Bui, Trung and Wang, William Yang and Zhang, Yang and Chang, Shiyu",
    title = "VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs",
    journal = "International Conference on Computer Vision (ICCV) 2025",
    year = "2025",
  url = {https://openaccess.thecvf.com/content/ICCV2025/html/Wu_VSP_Diagnosing_the_Dual_Challenges_of_Perception_and_Reasoning_in_ICCV_2025_paper.html},
}

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig

Preprint

arXiv:2506.09109 PDF Abstract BibTeX

Abstract: As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

@article{yayavaram2025caire,
    author = "Yayavaram, Arnav and Yayavaram, Siddharth and Khanuja, Simran and Saxon, Michael and Neubig, Graham",
    title = "CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation",
    journal = "arXiv preprint arXiv:2506.09109",
    year = "2025",
    abstract = "As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28\% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.",
  url = {https://arxiv.org/abs/2506.09109},
}

Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation
Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

ACL 2025

ACL Anthology PDF Abstract BibTeX

Abstract: Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

@article{feng2024tc,
    author = "Feng, Weixi and Li, Jiachen and Saxon, Michael and Fu, Tsu-jui and Chen, Wenhu and Wang, William Yang",
    title = "Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation",
    journal = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    abstract = "Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20\% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.",
  url = {https://aclanthology.org/2025.findings-acl.241/},
}

Can Vision Language Models Understand Mimed Actions?
Justin Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May

ACL 2025

ACL Anthology PDF Abstract BibTeX

Abstract: Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime---the theatrical technique of suggesting intent using only gesture, expression, and movement---is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs.

@article{cho2025mime,
    author = "Cho, Justin Hyundong and Lin, Spencer and Srinivasan, Tejas and Saxon, Michael and Kwon, Deuksin and Chavez, Natali T. and May, Jonathan",
    title = "Can Vision Language Models Understand Mimed Actions?",
    journal = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    abstract = "Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime---the theatrical technique of suggesting intent using only gesture, expression, and movement---is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs.",
  url = {https://aclanthology.org/2025.findings-acl.1372/},
}

2024

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

NeurIPS 2024 Spotlight

OpenReview PDF Abstract BibTeX

Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

@inproceedings{saxon2024evaluates,
    author = "Saxon, Michael and Jahara, Fatima and Khoshnoodi, Mahsa and Lu, Yujie and Sharma, Aditya and Wang, William Yang",
    title = "Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)",
    booktitle = "The Thirty-eighth Annual Conference on Neural Information Processing Systems",
    year = "2024",
    abstract = "With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.",
  url = {https://openreview.net/forum?id=S4YRCLbUK1},
}

Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
Michael Saxon*, Aditya Sharma*, William Yang Wang

EMNLP 2024

ACL Anthology PDF Abstract BibTeX Press Coverage

Abstract: We present LoCoVQA, a dynamic benchmark generator for evaluating long-context reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images.Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries—a task that is quite easy for language models (LMs) in the text domain—demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.

@inproceedings{sharma2024losing,
    author = "Sharma, Aditya and Saxon, Michael and Wang, William Yang",
    title = "Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    pages = "5429--5451",
    year = "2024",
    abstract = "We present LoCoVQA, a dynamic benchmark generator for evaluating long-context reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images.Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries—a task that is quite easy for language models (LMs) in the text domain—demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.",
  url = {https://aclanthology.org/2024.findings-emnlp.312/},
}

Benchmarks as Microscopes: A Call for Model Metrology
Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra

COLM 2024

OpenReview PDF Abstract BibTeX

Abstract: Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.

@inproceedings{saxon2024benchmarks,
    author = "Saxon, Michael and Holtzman, Ari and West, Peter and Wang, William Yang and Saphra, Naomi",
    title = "Benchmarks as Microscopes: A Call for Model Metrology",
    booktitle = "First Conference on Language Modeling (COLM 2024)",
    year = "2024",
    abstract = "Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.",
  url = {https://openreview.net/forum?id=bttKwCZDkm},
}

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts
Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

NAACL 2024 Conference Oral

ACL Anthology PDF Abstract BibTeX Presentation Video

Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.

@inproceedings{saxon2024lost,
    author = "Saxon, Michael and Luo, Yiran and Levy, Sharon and Baral, Chitta and Yang, Yezhou and Wang, William Yang",
    title = "Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts",
    booktitle = "NAACL 2024",
    pages = "572--582",
    year = "2024",
    abstract = {Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.},
  url = {https://aclanthology.org/2024.naacl-short.48/},
}

Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang

TACL 2024

ACL Anthology PDF Abstract BibTeX

Abstract: While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.

@article{pan2024automatically,
    author = "Pan, Liangming and Saxon, Michael and Xu, Wenda and Nathani, Deepak and Wang, Xinyi and Wang, William Yang",
    title = "Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "12",
    pages = "484--506",
    year = "2024",
    abstract = "While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.",
  url = {https://aclanthology.org/2024.tacl-1.27/},
}

2023

Let's think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought
Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang

EMNLP 2023

ACL Anthology PDF BibTeX

@inproceedings{himakunthala2023let,
    author = "Himakunthala, Vaishnavi and Ouyang, Andy and Rose, Daniel and He, Ryan and Mei, Alex and Lu, Yujie and Sonar, Chinmay and Saxon, Michael and Wang, William Yang",
    title = "Let's think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought",
    booktitle = "Conference on Empirical Methods in Natural Language Processing",
    year = "2023",
  url = {https://aclanthology.org/2023.emnlp-main.15/},
}

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning
Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, William Yang Wang

NeurIPS 2023

Official PDF Abstract BibTeX

Abstract: In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.

@article{wang2023large,
    author = "Wang, Xinyi and Zhu, Wanrong and Saxon, Michael and Steyvers, Mark and Wang, William Yang",
    title = "Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning",
    journal = "Advances in Neural Information Processing Systems",
    volume = "36",
    pages = "15614--15638",
    year = "2023",
    abstract = "In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.",
  url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/3255a7554605a88800f4e120b3a929e1-Abstract-Conference.html},
}

Multilingual Conceptual Coverage in Text-to-Image Models
Michael Saxon, William Yang Wang

ACL 2023

ACL Anthology PDF Abstract BibTeX

Abstract: We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language. This technique allows us to estimate how well-suited a model is to a target language as well as identify model-specific weaknesses, spurious correlations, and biases without a-priori assumptions. We demonstrate how it can be used to benchmark T2I models in terms of multilinguality, and how despite its simplicity it is a good proxy for impressive generalization.

@inproceedings{saxon2023multilingual,
    author = "Saxon, Michael and Wang, William Yang",
    title = "Multilingual Conceptual Coverage in Text-to-Image Models",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    pages = "4831--4848",
    year = "2023",
    abstract = {We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language. This technique allows us to estimate how well-suited a model is to a target language as well as identify model-specific weaknesses, spurious correlations, and biases without a-priori assumptions. We demonstrate how it can be used to benchmark T2I models in terms of multilinguality, and how despite its simplicity it is a good proxy for impressive generalization.},
  url = {https://aclanthology.org/2023.acl-long.266/},
}

CausalDialogue: Modeling Utterance-level Causality in Conversations
Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang

ACL 2023

ACL Anthology PDF BibTeX

@inproceedings{tuan2023causaldialogue,
    author = "Tuan, Yi-Lin and Albalak, Alon and Xu, Wenda and Saxon, Michael and Pryor, Connor and Getoor, Lise and Wang, William Yang",
    title = "CausalDialogue: Modeling Utterance-level Causality in Conversations",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    pages = "12506--12522",
    year = "2023",
  url = {https://aclanthology.org/2023.findings-acl.792/},
}

Visual chain of thought: bridging logical gaps with multimodal infillings
Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, William Yang Wang

Preprint

arXiv:2305.02317 PDF BibTeX

@article{rose2023visual,
    author = "Rose, Daniel and Himakunthala, Vaishnavi and Ouyang, Andy and He, Ryan and Mei, Alex and Lu, Yujie and Saxon, Michael and Sonar, Chinmay and Mirza, Diba and Wang, William Yang",
    title = "Visual chain of thought: bridging logical gaps with multimodal infillings",
    journal = "arXiv preprint arXiv:2305.02317",
    year = "2023",
  url = {https://arxiv.org/abs/2305.02317},
}

Users are the north star for ai transparency
Alex Mei, Michael Saxon, Shiyu Chang, Zachary C Lipton, William Yang Wang

Preprint

arXiv:2303.05500 PDF BibTeX

@article{mei2023users,
    author = "Mei, Alex and Saxon, Michael and Chang, Shiyu and Lipton, Zachary C and Wang, William Yang",
    title = "Users are the north star for ai transparency",
    journal = "arXiv preprint arXiv:2303.05500",
    year = "2023",
  url = {https://arxiv.org/abs/2303.05500},
}

PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers
Michael Saxon, Xinyi Wang, Wenda Xu, William Yang Wang

EACL 2023

ACL Anthology PDF Abstract BibTeX Presentation Video

Abstract: Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.

@inproceedings{saxon2023peco,
    author = "Saxon, Michael and Wang, Xinyi and Xu, Wenda and Wang, William Yang",
    title = "PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    pages = "3061--3074",
    year = "2023",
    abstract = "Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.",
  url = {https://aclanthology.org/2023.eacl-main.223/},
}

Wikiwhy: Answering and explaining cause-and-effect questions
Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, William Yang Wang

ICLR 2023 Conference Oral

OpenReview PDF Abstract BibTeX

Abstract: As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.

@inproceedings{ho2023wikiwhy,
    author = "Ho, Matthew and Sharma, Aditya and Chang, Justin and Saxon, Michael and Levy, Sharon and Lu, Yujie and Wang, William Yang",
    title = "Wikiwhy: Answering and explaining cause-and-effect questions",
    booktitle = "The Eleventh International Conference on Learning Representations",
    year = "2023",
    abstract = {As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7\% human-evaluated correctness in the end-to-end answer \& explain condition, leaving significant room for future improvements.},
  url = {https://openreview.net/forum?id=vaxnu-Ut},
}

Causal Balancing for Domain Generalization
Xinyi Wang, Michael Saxon, Jiachen Li, Hongyang Zhang, Kun Zhang, William Yang Wang

ICLR 2023

OpenReview PDF Abstract BibTeX

Abstract: While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.

@inproceedings{wang2023causal,
    author = "Wang, Xinyi and Saxon, Michael and Li, Jiachen and Zhang, Hongyang and Zhang, Kun and Wang, William Yang",
    title = "Causal Balancing for Domain Generalization",
    booktitle = "The Eleventh International Conference on Learning Representations",
    pages = "https--openreview",
    year = "2023",
    abstract = "While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.",
  url = {https://openreview.net/forum?id=F91SROvVJ_6},
}

2022

Self-supervised knowledge assimilation for expert-layman text style transfer
Wenda Xu, Michael Saxon, Misha Sra, William Yang Wang

AAAI 2022

BibTeX

@inproceedings{xu2022self,
    author = "Xu, Wenda and Saxon, Michael and Sra, Misha and Wang, William Yang",
    title = "Self-supervised knowledge assimilation for expert-layman text style transfer",
    booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
    volume = "36",
    number = "10",
    pages = "11566--11574",
    year = "2022",
}

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang

EMNLP 2022

ACL Anthology PDF Abstract BibTeX

Abstract: Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human annotated training data.

@inproceedings{xu2022not,
    author = "Xu, Wenda and Tuan, Yi-Lin and Lu, Yujie and Saxon, Michael and Li, Lei and Wang, William Yang",
    title = "Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    pages = "6559--6574",
    year = "2022",
    abstract = "Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human annotated training data.",
  url = {https://aclanthology.org/2022.findings-emnlp.489/},
}

2021

Investigating Memorization of Conspiracy Theories in Text Generation
Sharon Levy, Michael Saxon, William Yang Wang

ACL-IJCNLP 2021

BibTeX

@article{levy2021investigating,
    author = "Levy, Sharon and Saxon, Michael and Wang, William Yang",
    title = "Investigating Memorization of Conspiracy Theories in Text Generation",
    journal = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    pages = "4718--4729",
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

End-to-End Spoken Language Understanding for Generalized Voice Assistants
Michael Saxon, Samridhi Choudhary, Joseph P McKenna, Athanasios Mouchtaris

Interspeech 2021

BibTeX

@inproceedings{saxon2021end,
    author = "Saxon, Michael and Choudhary, Samridhi and McKenna, Joseph P and Mouchtaris, Athanasios",
    title = "End-to-End Spoken Language Understanding for Generalized Voice Assistants",
    booktitle = "Interspeech 2021",
    pages = "4738--4742",
    year = "2021",
}

Modeling Disclosive Transparency in NLP Application Descriptions
Michael Saxon, Sharon Levy, Xinyi Wang, Alon Albalak, William Yang Wang

EMNLP 2021

ACL Anthology PDF BibTeX

@inproceedings{saxon2021modeling,
    author = "Saxon, Michael and Levy, Sharon and Wang, Xinyi and Albalak, Alon and Wang, William Yang",
    title = "Modeling Disclosive Transparency in NLP Application Descriptions",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    pages = "2023--2037",
    year = "2021",
  url = {https://aclanthology.org/2021.emnlp-main.153/},
}

Counterfactual maximum likelihood estimation for training deep networks
Xinyi Wang, Wenhu Chen, Michael Saxon, William Yang Wang

NeurIPS 2021

BibTeX

@article{wang2021counterfactual,
    author = "Wang, Xinyi and Chen, Wenhu and Saxon, Michael and Wang, William Yang",
    title = "Counterfactual maximum likelihood estimation for training deep networks",
    journal = "Advances in Neural Information Processing Systems",
    volume = "34",
    pages = "25072--25085",
    year = "2021",
}

2020

Semantic Complexity in End-to-End Spoken Language Understanding
Michael Saxon*, Joseph P McKenna*, Samridhi Choudhary*, Grant P Strimel, Athanasios Mouchtaris

Interspeech 2020

BibTeX

@inproceedings{mckenna2020semantic,
    author = "McKenna, Joseph P and Choudhary, Samridhi and Saxon, Michael and Strimel, Grant P and Mouchtaris, Athanasios",
    title = "Semantic Complexity in End-to-End Spoken Language Understanding",
    booktitle = "Proc. Interspeech 2020",
    pages = "4273--4277",
    year = "2020",
}

UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech
Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan

Interspeech 2020

BibTeX

@article{moore2020uncommonvoice,
    author = "Moore, Meredith and Papreja, Piyush and Saxon, Michael and Berisha, Visar and Panchanathan, Sethuraman",
    title = "UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech",
    journal = "Proc. Interspeech 2020",
    pages = "2532--2536",
    year = "2020",
}

Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features
Michael Saxon, Ayush Tripathi, Yishan Jiao, Julie Liss, Visar Berisha

IEEE/ACM TASLP 2020

BibTeX

@article{saxon2020robust,
    author = "Saxon, Michael and Tripathi, Ayush and Jiao, Yishan and Liss, Julie and Berisha, Visar",
    title = "Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features",
    journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
    volume = "28",
    pages = "2511--2522",
    year = "2020",
    publisher = "IEEE",
}

2019

Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make
Meredith Moore, Michael Saxon, Hemanth Venkateswara, Visar Berisha, Sethuraman Panchanathan

Interspeech 2019

BibTeX

@article{moore2019say,
    author = "Moore, Meredith and Saxon, Michael and Venkateswara, Hemanth and Berisha, Visar and Panchanathan, Sethuraman",
    title = "Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make",
    journal = "Proc. Interspeech 2019",
    pages = "2528--2532",
    year = "2019",
}

Objective measures of plosive nasalization in hypernasal speech
Michael Saxon, Julie Liss, Visar Berisha

ICASSP 2019

BibTeX

@inproceedings{saxon2019objective,
    author = "Saxon, Michael and Liss, Julie and Berisha, Visar",
    title = "Objective measures of plosive nasalization in hypernasal speech",
    booktitle = "ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
    pages = "6520--6524",
    year = "2019",
    organization = "IEEE",
}

2016

2D Grating Pitch Mapping of a through Silicon Via (TSV) and Solder Ball Interconnect Region Using Laser Diffraction: IEEE Electronic Components and Technology Conference, 2016
Todd Houghton, Michael Saxon, Zeming Song, Hoa Nyugen, Hanqing Jiang, Hongbin Yu

IEEE ECTC 2016

Official PDF Abstract BibTeX

Abstract: Planar Strain mapping of interconnects, such as Ball Grid Arrays (BGA) and Through-Silicon Vias (TSVs), is an important step in the development and testing of electronics packages, as excessive strain can lead to device failure. Today, the two most widely adopted methods of experimental strain field measurement, Digital Image Correlation and Moiré Interferometry, encounter limitations when the average strain magnitude drops below 1μm on thermally loaded samples. DIC provides only limited fields of view and increases measurement complexity, while Moiré Interferometry suffers from resolution near material interfaces. If well bonded to a surface, changes in periodicity of a thin diffraction grating can be strongly dependent on the strain field. The local grating periodicity, can then be measured using laser diffraction. We present a means of mapping the pitch of a grating bonded to the surface of a through-silicon via interconnect in two dimensions over a wide field of view, with a high degree of repeatability at room and elevated temperature.

@inproceedings{houghton20162d,
    author = "Houghton, Todd and Saxon, Michael and Song, Zeming and Nyugen, Hoa and Jiang, Hanqing and Yu, Hongbin",
    title = "2D Grating Pitch Mapping of a through Silicon Via (TSV) and Solder Ball Interconnect Region Using Laser Diffraction: IEEE Electronic Components and Technology Conference, 2016",
    booktitle = "2016 IEEE 66th Electronic Components and Technology Conference (ECTC)",
    pages = "2222--2227",
    year = "2016",
    organization = "IEEE",
    abstract = "Planar Strain mapping of interconnects, such as Ball Grid Arrays (BGA) and Through-Silicon Vias (TSVs), is an important step in the development and testing of electronics packages, as excessive strain can lead to device failure. Today, the two most widely adopted methods of experimental strain field measurement, Digital Image Correlation and Moiré Interferometry, encounter limitations when the average strain magnitude drops below 1μm on thermally loaded samples. DIC provides only limited fields of view and increases measurement complexity, while Moiré Interferometry suffers from resolution near material interfaces. If well bonded to a surface, changes in periodicity of a thin diffraction grating can be strongly dependent on the strain field. The local grating periodicity, can then be measured using laser diffraction. We present a means of mapping the pitch of a grating bonded to the surface of a through-silicon via interconnect in two dimensions over a wide field of view, with a high degree of repeatability at room and elevated temperature.",
  url = {https://ieeexplore.ieee.org/document/7545732},
}

Non-archival Presentations and Workshop Papers

2023

Disparities in Text-to-Image Model Concept Possession Across Languages
Michael Saxon, William Yang Wang
FAccT 2023

BibTeX Presentation

@inproceedings{saxon2023disparities,
    author = "Saxon, Michael and Wang, William Yang",
    title = "Disparities in Text-to-Image Model Concept Possession Across Languages",
    booktitle = "Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency",
    pages = "1870--1870",
    year = "2023",
}

Data Augmentation for Diverse Voice Conversion in Noisy Environments
Avani Tanna, Michael Saxon, Amr El Abbadi, William Yang Wang
INTERSPEECH 2023

BibTeX

@inproceedings{tanna2023data,
    author = "Tanna, Avani and Saxon, Michael and Abbadi, Amr El and Wang, William Yang",
    title = "Data Augmentation for Diverse Voice Conversion in Noisy Environments",
    booktitle = "Interspeech 2023 Demos",
    pages = "2024--2025",
    year = "2023",
}

2019

Word pair convolutional model for happy moment classification
Michael Saxon, Samarth Bhandari, Lewis Ruskin, Gabrielle Honda
AffCon2019

Official PDF BibTeX

@inproceedings{saxon2019word,
    author = "Saxon, Michael and Bhandari, Samarth and Ruskin, Lewis and Honda, Gabrielle",
    title = "Word pair convolutional model for happy moment classification",
    booktitle = "Proceedings of the 2nd Workshop on Affective Content Analysis@ AAAI (AffCon2019), Honolulu, Hawaii (January 2019)",
    pages = "111--119",
    year = "2019",
  url = {https://ceur-ws.org/Vol-2328/},
}

2018

Chat-Box: Proposing a Mood Analyzer for Individuals with Social Interaction Disabilities
Bineeta Gupta, Michael Saxon, Troy McDaniel, Sethuraman Panchanathan
HCI International 2018

Official Abstract BibTeX

Abstract: Perception of social cues is a fundamental communicative skill that can be hampered by hearing and cognitive disorders. Understanding slang and sarcastic intent is often difficult in verbal communication, particularly for individuals who struggle with the perception of social cues. Misinterpretation of slang terms can cause discomfort or social isolation. Sarcasm is particularly difficult to recognize due to its inherently ambiguous and context-dependent nature. We have identified two problems of particular interest in social assistive technologies – slang word sentiment assessment and sarcasm detection. We propose combining a slang sentiment analysis model with a speech emotion analysis model to create an assistive tool, Chat-Box, which will detect social cues such as sarcasm, slang, and sentiment.

@inproceedings{gupta2018chat,
    author = "Gupta, Bineeta and Saxon, Michael and McDaniel, Troy and Panchanathan, Sethuraman",
    title = "Chat-Box: Proposing a Mood Analyzer for Individuals with Social Interaction Disabilities",
    booktitle = "HCI International 2018--Posters' Extended Abstracts: 20th International Conference, HCI International 2018, Las Vegas, NV, USA, July 15-20, 2018, Proceedings, Part II 20",
    pages = "394--401",
    year = "2018",
    organization = "Springer International Publishing",
    abstract = "Perception of social cues is a fundamental communicative skill that can be hampered by hearing and cognitive disorders. Understanding slang and sarcastic intent is often difficult in verbal communication, particularly for individuals who struggle with the perception of social cues. Misinterpretation of slang terms can cause discomfort or social isolation. Sarcasm is particularly difficult to recognize due to its inherently ambiguous and context-dependent nature. We have identified two problems of particular interest in social assistive technologies – slang word sentiment assessment and sarcasm detection. We propose combining a slang sentiment analysis model with a speech emotion analysis model to create an assistive tool, Chat-Box, which will detect social cues such as sarcasm, slang, and sentiment.",
  url = {https://link.springer.com/chapter/10.1007/978-3-319-92279-9_53},
}