Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.
@inproceedings{saxon2024evaluates,
author = "Saxon, Michael and Jahara, Fatima and Khoshnoodi, Mahsa and Lu, Yujie and Sharma, Aditya and Wang, William Yang",
title = "Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)",
booktitle = "The Thirty-eighth Annual Conference on Neural Information Processing Systems",
year = "2024",
abstract = "With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMScore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.",
url = {https://openreview.net/forum?id=S4YRCLbUK1},
}
Abstract: We present LoCoVQA, a dynamic benchmark generator for evaluating long-context reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images.Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries—a task that is quite easy for language models (LMs) in the text domain—demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
@inproceedings{sharma2024losing,
author = "Sharma*, Aditya and Saxon*, Michael and Wang, William Yang",
title = "Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
pages = "5429--5451",
year = "2024",
abstract = "We present LoCoVQA, a dynamic benchmark generator for evaluating long-context reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images.Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking logarithmic decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries—a task that is quite easy for language models (LMs) in the text domain—demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.",
url = {https://aclanthology.org/2024.findings-emnlp.312/},
}
Abstract: Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.
@inproceedings{saxon2024benchmarks,
author = "Saxon, Michael and Holtzman, Ari and West, Peter and Wang, William Yang and Saphra, Naomi",
title = "Benchmarks as Microscopes: A Call for Model Metrology",
booktitle = "First Conference on Language Modeling (COLM 2024)",
year = "2024",
abstract = "Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.",
url = {https://openreview.net/forum?id=bttKwCZDkm},
}
Abstract: Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.
@article{feng2024tc,
author = "Feng, Weixi and Li, Jiachen and Saxon, Michael and Fu, Tsu-jui and Chen, Wenhu and Wang, William Yang",
title = "Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation",
journal = "arXiv preprint arXiv:2406.08656",
year = "2024",
abstract = "Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20\% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.",
url = {https://arxiv.org/abs/2406.08656},
}
Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.
@inproceedings{saxon2024lost,
author = "Saxon, Michael and Luo, Yiran and Levy, Sharon and Baral, Chitta and Yang, Yezhou and Wang, William Yang",
title = "Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts",
booktitle = "NAACL 2024",
pages = "572--582",
year = "2024",
abstract = {Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and comparing the output image populations. Unfortunately, we find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese. We provide corrections for these errors and analyze how impactful they are on the utility and validity of CoCo-CroLa as a benchmark. We reassess multiple baseline T2I models with the revisions, compare the outputs elicited under the new translations to those conditioned on the old, and show that a correction's impactfulness on the image-domain benchmark results can be predicted in the text domain with similarity scores. Our findings will guide the future development of T2I multilinguality metrics by providing analytical tools for practical translation decisions.},
url = {https://aclanthology.org/2024.naacl-short.48/},
}
Abstract: While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.
@article{pan2024automatically,
author = "Pan, Liangming and Saxon, Michael and Xu, Wenda and Nathani, Deepak and Wang, Xinyi and Wang, William Yang",
title = "Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies",
journal = "Transactions of the Association for Computational Linguistics",
volume = "12",
pages = "484--506",
year = "2024",
abstract = "While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.",
url = {https://aclanthology.org/2024.tacl-1.27/},
}
@article{wu2024vsp,
author = "Wu, Qiucheng and Zhao, Handong and Saxon, Michael and Bui, Trung and Wang, William Yang and Zhang, Yang and Chang, Shiyu",
title = "Vsp: Assessing the dual challenges of perception and reasoning in spatial planning tasks for vlms",
journal = "arXiv preprint arXiv:2407.01863",
year = "2024",
url = {https://arxiv.org/abs/2407.01863},
}
@inproceedings{himakunthala2023let,
author = "Himakunthala, Vaishnavi and Ouyang, Andy and Rose, Daniel and He, Ryan and Mei, Alex and Lu, Yujie and Sonar, Chinmay and Saxon, Michael and Wang, William Yang",
title = "Let's think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought",
booktitle = "Conference on Empirical Methods in Natural Language Processing",
year = "2023",
url = {https://arxiv.org/abs/2305.13903},
}
Abstract: In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.
@article{wang2023large,
author = "Wang, Xinyi and Zhu, Wanrong and Saxon, Michael and Steyvers, Mark and Wang, William Yang",
title = "Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning",
journal = "Advances in Neural Information Processing Systems",
volume = "36",
pages = "15614--15638",
year = "2023",
abstract = "In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.",
url = {https://proceedings.neurips.cc/paper_files/paper/2023/hash/3255a7554605a88800f4e120b3a929e1-Abstract-Conference.html},
}
Abstract: We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language. This technique allows us to estimate how well-suited a model is to a target language as well as identify model-specific weaknesses, spurious correlations, and biases without a-priori assumptions. We demonstrate how it can be used to benchmark T2I models in terms of multilinguality, and how despite its simplicity it is a good proxy for impressive generalization.
@inproceedings{saxon2023multilingual,
author = "Saxon, Michael and Wang, William Yang",
title = "Multilingual Conceptual Coverage in Text-to-Image Models",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
pages = "4831--4848",
year = "2023",
abstract = {We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language. This technique allows us to estimate how well-suited a model is to a target language as well as identify model-specific weaknesses, spurious correlations, and biases without a-priori assumptions. We demonstrate how it can be used to benchmark T2I models in terms of multilinguality, and how despite its simplicity it is a good proxy for impressive generalization.},
url = {https://aclanthology.org/2023.acl-long.266/},
}
CausalDialogue: Modeling Utterance-level Causality in Conversations Yi-Lin Tuan, Alon Albalak, Wenda Xu, Michael Saxon, Connor Pryor, Lise Getoor, William Yang Wang
ACL 2023
@inproceedings{tuan2023causaldialogue,
author = "Tuan, Yi-Lin and Albalak, Alon and Xu, Wenda and Saxon, Michael and Pryor, Connor and Getoor, Lise and Wang, William Yang",
title = "CausalDialogue: Modeling Utterance-level Causality in Conversations",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
pages = "12506--12522",
year = "2023",
}
@article{rose2023visual,
author = "Rose, Daniel and Himakunthala, Vaishnavi and Ouyang, Andy and He, Ryan and Mei, Alex and Lu, Yujie and Saxon, Michael and Sonar, Chinmay and Mirza, Diba and Wang, William Yang",
title = "Visual chain of thought: bridging logical gaps with multimodal infillings",
journal = "arXiv preprint arXiv:2305.02317",
year = "2023",
url = {https://arxiv.org/abs/2305.02317},
}
@article{mei2023users,
author = "Mei, Alex and Saxon, Michael and Chang, Shiyu and Lipton, Zachary C and Wang, William Yang",
title = "Users are the north star for ai transparency",
journal = "arXiv preprint arXiv:2303.05500",
year = "2023",
url = {https://arxiv.org/abs/2303.05500},
}
Abstract: Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.
@inproceedings{saxon2023peco,
author = "Saxon, Michael and Wang, Xinyi and Xu, Wenda and Wang, William Yang",
title = "PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
pages = "3061--3074",
year = "2023",
abstract = "Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.",
url = {https://aclanthology.org/2023.eacl-main.223/},
}
Abstract: As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.
@inproceedings{ho2023wikiwhy,
author = "Ho, Matthew and Sharma, Aditya and Chang, Justin and Saxon, Michael and Levy, Sharon and Lu, Yujie and Wang, William Yang",
title = "Wikiwhy: Answering and explaining cause-and-effect questions",
booktitle = "The Eleventh International Conference on Learning Representations",
year = "2023",
abstract = {As large language models (LLMs) grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7\% human-evaluated correctness in the end-to-end answer \& explain condition, leaving significant room for future improvements.},
url = {https://openreview.net/forum?id=vaxnu-Ut},
}
Abstract: While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.
@inproceedings{wang2023causal,
author = "Wang, Xinyi and Saxon, Michael and Li, Jiachen and Zhang, Hongyang and Zhang, Kun and Wang, William Yang",
title = "Causal Balancing for Domain Generalization",
booktitle = "The Eleventh International Conference on Learning Representations",
pages = "https--openreview",
year = "2023",
abstract = "While machine learning models rapidly advance the state-of-the-art on various real-world tasks, out-of-domain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced mini-batch sampling strategy to transform a biased data distribution into a spurious-free balanced distribution, based on the invariance of the underlying causal mechanisms for the data generation process. We argue that the Bayes optimal classifiers trained on such balanced distribution are minimax optimal across a diverse enough environment space. We also provide an identifiability guarantee of the latent variable model of the proposed data generation process, when utilizing enough train environments. Experiments are conducted on DomainBed, demonstrating empirically that our method obtains the best performance across 20 baselines reported on the benchmark.",
url = {https://openreview.net/forum?id=F91SROvVJ_6},
}
2022
Self-supervised knowledge assimilation for expert-layman text style transfer Wenda Xu, Michael Saxon, Misha Sra, William Yang Wang
AAAI 2022
@inproceedings{xu2022self,
author = "Xu, Wenda and Saxon, Michael and Sra, Misha and Wang, William Yang",
title = "Self-supervised knowledge assimilation for expert-layman text style transfer",
booktitle = "Proceedings of the AAAI Conference on Artificial Intelligence",
volume = "36",
number = "10",
pages = "11566--11574",
year = "2022",
}
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
EMNLP 2022
@inproceedings{xu2022not,
author = "Xu, Wenda and Tuan, Yi-Lin and Lu, Yujie and Saxon, Michael and Li, Lei and Wang, William Yang",
title = "Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
pages = "6559--6574",
year = "2022",
}
2021
Investigating Memorization of Conspiracy Theories in Text Generation Sharon Levy, Michael Saxon, William Yang Wang
ACL-IJCNLP 2021
@article{levy2021investigating,
author = "Levy, Sharon and Saxon, Michael and Wang, William Yang",
title = "Investigating Memorization of Conspiracy Theories in Text Generation",
journal = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
pages = "4718--4729",
year = "2021",
publisher = "Association for Computational Linguistics",
}
End-to-End Spoken Language Understanding for Generalized Voice Assistants Michael Saxon, Samridhi Choudhary, Joseph P McKenna, Athanasios Mouchtaris
Interspeech 2021
@inproceedings{saxon2021end,
author = "Saxon, Michael and Choudhary, Samridhi and McKenna, Joseph P and Mouchtaris, Athanasios",
title = "End-to-End Spoken Language Understanding for Generalized Voice Assistants",
booktitle = "Interspeech 2021",
pages = "4738--4742",
year = "2021",
}
@inproceedings{saxon2021modeling,
author = "Saxon, Michael and Levy, Sharon and Wang, Xinyi and Albalak, Alon and Wang, William Yang",
title = "Modeling Disclosive Transparency in NLP Application Descriptions",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
pages = "2023--2037",
year = "2021",
url = {https://aclanthology.org/2021.emnlp-main.153/},
}
Counterfactual maximum likelihood estimation for training deep networks Xinyi Wang, Wenhu Chen, Michael Saxon, William Yang Wang
NeurIPS 2021
@article{wang2021counterfactual,
author = "Wang, Xinyi and Chen, Wenhu and Saxon, Michael and Wang, William Yang",
title = "Counterfactual maximum likelihood estimation for training deep networks",
journal = "Advances in Neural Information Processing Systems",
volume = "34",
pages = "25072--25085",
year = "2021",
}
2020
Semantic Complexity in End-to-End Spoken Language Understanding Michael Saxon*, Joseph P McKenna*, Samridhi Choudhary*, Grant P Strimel, Athanasios Mouchtaris
Interspeech 2020
@inproceedings{mckenna2020semantic,
author = "McKenna*, Joseph P and Choudhary*, Samridhi and Saxon*, Michael and Strimel, Grant P and Mouchtaris, Athanasios",
title = "Semantic Complexity in End-to-End Spoken Language Understanding",
booktitle = "Proc. Interspeech 2020",
pages = "4273--4277",
year = "2020",
}
UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech Meredith Moore, Piyush Papreja, Michael Saxon, Visar Berisha, Sethuraman Panchanathan
Interspeech 2020
@article{moore2020uncommonvoice,
author = "Moore, Meredith and Papreja, Piyush and Saxon, Michael and Berisha, Visar and Panchanathan, Sethuraman",
title = "UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech",
journal = "Proc. Interspeech 2020",
pages = "2532--2536",
year = "2020",
}
Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features Michael Saxon, Ayush Tripathi, Yishan Jiao, Julie Liss, Visar Berisha
IEEE/ACM TASLP 2020
@article{saxon2020robust,
author = "Saxon, Michael and Tripathi, Ayush and Jiao, Yishan and Liss, Julie and Berisha, Visar",
title = "Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features",
journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
volume = "28",
pages = "2511--2522",
year = "2020",
publisher = "IEEE",
}
2019
Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make Meredith Moore, Michael Saxon, Hemanth Venkateswara, Visar Berisha, Sethuraman Panchanathan
Interspeech 2019
@article{moore2019say,
author = "Moore, Meredith and Saxon, Michael and Venkateswara, Hemanth and Berisha, Visar and Panchanathan, Sethuraman",
title = "Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make",
journal = "Proc. Interspeech 2019",
pages = "2528--2532",
year = "2019",
}
Objective measures of plosive nasalization in hypernasal speech Michael Saxon, Julie Liss, Visar Berisha
ICASSP 2019
@inproceedings{saxon2019objective,
author = "Saxon, Michael and Liss, Julie and Berisha, Visar",
title = "Objective measures of plosive nasalization in hypernasal speech",
booktitle = "ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
pages = "6520--6524",
year = "2019",
organization = "IEEE",
}
Abstract: Planar Strain mapping of interconnects, such as Ball Grid Arrays (BGA) and Through-Silicon Vias (TSVs), is an important step in the development and testing of electronics packages, as excessive strain can lead to device failure. Today, the two most widely adopted methods of experimental strain field measurement, Digital Image Correlation and Moiré Interferometry, encounter limitations when the average strain magnitude drops below 1μm on thermally loaded samples. DIC provides only limited fields of view and increases measurement complexity, while Moiré Interferometry suffers from resolution near material interfaces. If well bonded to a surface, changes in periodicity of a thin diffraction grating can be strongly dependent on the strain field. The local grating periodicity, can then be measured using laser diffraction. We present a means of mapping the pitch of a grating bonded to the surface of a through-silicon via interconnect in two dimensions over a wide field of view, with a high degree of repeatability at room and elevated temperature.
@inproceedings{houghton20162d,
author = "Houghton, Todd and Saxon, Michael and Song, Zeming and Nyugen, Hoa and Jiang, Hanqing and Yu, Hongbin",
title = "2D Grating Pitch Mapping of a through Silicon Via (TSV) and Solder Ball Interconnect Region Using Laser Diffraction: IEEE Electronic Components and Technology Conference, 2016",
booktitle = "2016 IEEE 66th Electronic Components and Technology Conference (ECTC)",
pages = "2222--2227",
year = "2016",
organization = "IEEE",
abstract = "Planar Strain mapping of interconnects, such as Ball Grid Arrays (BGA) and Through-Silicon Vias (TSVs), is an important step in the development and testing of electronics packages, as excessive strain can lead to device failure. Today, the two most widely adopted methods of experimental strain field measurement, Digital Image Correlation and Moiré Interferometry, encounter limitations when the average strain magnitude drops below 1μm on thermally loaded samples. DIC provides only limited fields of view and increases measurement complexity, while Moiré Interferometry suffers from resolution near material interfaces. If well bonded to a surface, changes in periodicity of a thin diffraction grating can be strongly dependent on the strain field. The local grating periodicity, can then be measured using laser diffraction. We present a means of mapping the pitch of a grating bonded to the surface of a through-silicon via interconnect in two dimensions over a wide field of view, with a high degree of repeatability at room and elevated temperature.",
url = {https://ieeexplore.ieee.org/document/7545732},
}
Non-archival Presenations and Workshop Papers
2023
Disparities in Text-to-Image Model Concept Possession Across Languages Michael Saxon, William Yang Wang
FAccT 2023
@inproceedings{saxon2023disparities,
author = "Saxon, Michael and Wang, William Yang",
title = "Disparities in Text-to-Image Model Concept Possession Across Languages",
booktitle = "Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency",
pages = "1870--1870",
year = "2023",
}
Data Augmentation for Diverse Voice Conversion in Noisy Environments Avani Tanna, Michael Saxon, Amr El Abbadi, William Yang Wang
INTERSPEECH 2023
@inproceedings{tanna2023data,
author = "Tanna, Avani and Saxon, Michael and Abbadi, Amr El and Wang, William Yang",
title = "Data Augmentation for Diverse Voice Conversion in Noisy Environments",
booktitle = "Interspeech 2023 Demos",
pages = "2024--2025",
year = "2023",
}
@inproceedings{saxon2019word,
author = "Saxon, Michael and Bhandari, Samarth and Ruskin, Lewis and Honda, Gabrielle",
title = "Word pair convolutional model for happy moment classification",
booktitle = "Proceedings of the 2nd Workshop on Affective Content Analysis@ AAAI (AffCon2019), Honolulu, Hawaii (January 2019)",
pages = "111--119",
year = "2019",
url = {https://ceur-ws.org/Vol-2328/},
}
Abstract: Perception of social cues is a fundamental communicative skill that can be hampered by hearing and cognitive disorders. Understanding slang and sarcastic intent is often difficult in verbal communication, particularly for individuals who struggle with the perception of social cues. Misinterpretation of slang terms can cause discomfort or social isolation. Sarcasm is particularly difficult to recognize due to its inherently ambiguous and context-dependent nature. We have identified two problems of particular interest in social assistive technologies – slang word sentiment assessment and sarcasm detection. We propose combining a slang sentiment analysis model with a speech emotion analysis model to create an assistive tool, Chat-Box, which will detect social cues such as sarcasm, slang, and sentiment.
@inproceedings{gupta2018chat,
author = "Gupta, Bineeta and Saxon, Michael and McDaniel, Troy and Panchanathan, Sethuraman",
title = "Chat-Box: Proposing a Mood Analyzer for Individuals with Social Interaction Disabilities",
booktitle = "HCI International 2018--Posters' Extended Abstracts: 20th International Conference, HCI International 2018, Las Vegas, NV, USA, July 15-20, 2018, Proceedings, Part II 20",
pages = "394--401",
year = "2018",
organization = "Springer International Publishing",
abstract = "Perception of social cues is a fundamental communicative skill that can be hampered by hearing and cognitive disorders. Understanding slang and sarcastic intent is often difficult in verbal communication, particularly for individuals who struggle with the perception of social cues. Misinterpretation of slang terms can cause discomfort or social isolation. Sarcasm is particularly difficult to recognize due to its inherently ambiguous and context-dependent nature. We have identified two problems of particular interest in social assistive technologies – slang word sentiment assessment and sarcasm detection. We propose combining a slang sentiment analysis model with a speech emotion analysis model to create an assistive tool, Chat-Box, which will detect social cues such as sarcasm, slang, and sentiment.",
url = {https://link.springer.com/chapter/10.1007/978-3-319-92279-9_53},
}