Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Yasaman Razeghi - Robert L. Logan IV - Matt Gardner - Sameer Singh

Abstract Paper

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above 70% (absolute) more accurate on the top 10% frequent terms in comparison to the bottom 10%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

FRUIT: Faithfully Reflecting Updated Information in Text

NAACL 2022 (Best New Task)

Robert L. Logan IV - Alexandre Passos - Sameer Singh - Ming-Wei Chang

Abstract Paper

Textual knowledge bases such as Wikipedia require considerable effort to keep up to date and consistent. While automated writing assistants could potentially ease this burden, the problem of suggesting edits grounded in external knowledge has been under-explored. In this paper, we introduce the novel generation task of *faithfully reflecting updated information in text*(FRUIT) where the goal is to update an existing article given new evidence. We release the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data produced from pairs of Wikipedia snapshots, along with our data generation pipeline and a gold evaluation set of 914 instances whose edits are guaranteed to be supported by the evidence. We provide benchmark results for popular generation systems as well as EDIT5 -- a T5-based approach tailored to editing we introduce that establishes the state of the art. Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models, and opens doors to many new applications.

Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models

ACL Findings 2022, ENLSP Workshop @ NeurIPS 2021 (Best Poster)

Robert L. Logan IV - Ivana Balazevic - Eric Wallace - Fabio Petroni - Sameer Singh - Sebastian Riedel

Abstract Paper Code

Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither taskspecific templates nor training examples, and achieve competitive accuracy to manuallytuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.

Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference

ACL 2021

Robert L. Logan IV - Andrew McCallum - Sameer Singh - Dan Bikel

Abstract Paper Code

Streaming cross document entity coreference (CDC) systems disambiguate mentions of named entities in a scalable manner via incremental clustering. Unlike other approaches for named entity disambiguation (e.g., entity linking), streaming CDC allows for the disambiguation of entities that are unknown at inference time. Thus, it is well-suited for processing streams of data where new entities are frequently introduced. Despite these benefits, this task is currently difficult to study, as existing approaches are either evaluated on datasets that are no longer available, or omit other crucial details needed to ensure fair comparison. In this work, we address this issue by compiling a large benchmark adapted from existing free datasets, and performing a comprehensive evaluation of a number of novel and existing baseline models. We investigate: how to best encode mentions, which clustering algorithms are most effective for grouping mentions, how models transfer to different domains, and how bounding the number of mentions tracked during inference impacts performance. Our results show that the relative performance of neural and feature-based mention encoders varies across different domains, and in most cases the best performance is achieved using a combination of both approaches. We also find that performance is minimally impacted by limiting the number of tracked mentions.

Active Bayesian Assessment for Black-Box Classifiers

AAAI 2021

Disi Ji - Robert L. Logan IV - Padhraic Smyth - Mark Steyvers

Abstract Paper Code

Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an active Bayesian approach for assessment of classifier performance to satisfy the desiderata of both reliability and label-efficiency. We begin by developing inference strategies to quantify uncertainty for common assessment metrics such as accuracy, misclassification cost, and calibration error. We then propose a general framework for active Bayesian assessment using inferred uncertainty to guide efficient selection of instances for labeling, enabling better performance assessment with fewer labels. We demonstrate significant gains from our proposed active Bayesian approach via a series of systematic empirical experiments assessing the performance of modern neural classifiers (e.g., ResNet and BERT) on several standard image and text classification datasets.

Deriving Behavioral Tests from Common Sense Knowledge Graphs

CSKGs Workshop @ AAAI 2021

Yasaman Razeghi - Robert L. Logan IV - Sameer Singh

Abstract Paper

Although NLP models have demonstrated “superhuman” performance on common sense reasoning tasks, it is unclear whether these models truly have common sense knowledge. Constructing evaluation datasets to test this knowledge is expensive due to the manual effort involved, and is also limited in scope. Meanwhile, common sense knowledge graphs (CSKGs) aim for a wide coverage of structured common sense knowledge, but can not be directly used for testing purposes. In this work, we introduce a semi-automated approach that leverages CSKGs to construct out-of-domain evaluation sets for NLP tasks that are more scalable than purely manual approaches. Using this procedure, we create test cases from two popular CSKGs—ConceptNet and ATOMIC—to test the common sense reasoning capability of models trained for natural language inference (NLI) and question answering (QA). These tests reveal interesting differences in failure modes of these models; models trained on NLI tend to perform better on tests of ontological knowledge, e.g. ’is a’ and ’used for’ relations, failing on tests that require understanding ’desires’, ’needs’, and ’wants’, while QA models perform better on tests that involve ’wants’, and ’desires’.

AutoPrompt: Eliciting Knowledge from Language Models Using Automatically Generated Prompts

EMNLP 2020

Taylor Shin* - Yasaman Razeghi* - Robert L. Logan IV* - Eric Wallace - Sameer Singh

Abstract Paper Code

Determining the knowledge captured by pretrained language models is an important challenge, and is commonly tackled by probing model representations using classifiers. However, it is difficult to design probes for semantic knowledge such as facts or textual entailment. Reformulating these semantic tasks as cloze tests (i.e., fill-in-the-blank problems) is a promising method for probing such knowledge, but, requires manual crafting of textual prompts to elicit this knowledge, limiting its use. In this paper, we develop an automated, task-agnostic method to create cloze prompts for any classification task, based on a gradient-guided search. We find prompts that demonstrate MLMs have an inherent capability to perform sentiment analysis and natural language inference, and without any finetuning, sometimes achieve performance on-par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs compared to manual prompts, and further, MLMs can be used as relation extractors out of the box, when prompted with suitable prompts, more effectively than recent supervised RE models.

Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ

EMNLP 2020 - Demo

Qiang Ning - Hao Wu - Pradeep Dasigi - Dheeru Dua - Matt Gardner - Robert L. Logan IV - Ana Marasovic - Zhenjin Nie

Abstract Paper Code

High-quality and large-scale data are key to success for AI systems. However, large-scale data annotation efforts are often confronted with a set of common challenges: (1) designing a user-friendly annotation interface; (2) training enough annotators efficiently; and (3) reproducibility. To address these problems, we introduce CROWDAQ, an open-source platform that standardizes the data collection pipeline with customizable user-interface components, automated annotator qualification, and saved pipelines in a re-usable format. We show that CROWDAQ simplifies data annotation significantly on a diverse set of data collection use cases and we hope it will be a convenient tool for the community.

COVIDLies: Detecting COVID-19 Misinformation on Social Media

NLP-COVID19 Workshop @ EMNLP 2020 (Best Paper)

Tamanna Hossain* - Robert L. Logan IV* - Arjuna Ugarte* - Yoshitomo Matsubara* - Sean Young - Sameer Singh

Abstract Paper Code Dataset

The ongoing pandemic has heightened the need for developing tools to flag COVID-19-related misinformation on the internet, specifically on social media such as Twitter. However, due to novel language and the rapid change of information, existing misinformation detection datasets are not effective in evaluating systems designed to detect misinformation on this topic. Misinformation detection can be subdivided into two sub-tasks - retrieval of misconceptions relevant to posts being checked for veracity, and stance detection to identify whether the posts agree, disagree, or express no stance towards the retrieved misconceptions. To facilitate research on this task, we release COVIDLies, a dataset of 5K expert-annotated tweets to evaluate the performance of misinformation detection systems on 86 different pieces of COVID-19 related misinformation. We evaluate existing NLP systems on this dataset, providing first benchmarks and identifying key challenges for future models to improve upon.

On Importance Sampling-Based Evaluation of Latent Language Models

ACL 2020

Robert L. Logan IV - Matt Gardner - Sameer Singh

Abstract Paper

Language models that use additional latent structures (e.g., syntax trees, coreference chains, knowledge graph links) provide several advantages over traditional language models. However, likelihood-based evaluation of these models is often intractable as it requires marginalizing over the latent space. Existing works avoid this issue by using importance sampling. Although this approach has asymptotic guarantees, analysis is rarely conducted on the effect of decisions such as sample size and choice of proposal distribution on the reported estimates. In this paper, we carry out this analysis for three models: RNNG, EntityNLM, and KGLM. In addition, we elucidate subtle differences in how importance sampling is applied in these works that can have substantial effects on the final estimates, as well as provide theoretical results which reinforce the validity of this technique.

Detecting Conversation Topics in Primary Care Office Visits from Transcripts of Patient-Provider Interactions

Journal of the American Medical Informatics Association, Volume 26, Issue 12, December 2019

Jihyun Park - Dimitrios Kotzias - Patty Kuo - Robert L. Logan IV - Kritzia Merced - Sameer Singh - Michael Tanana - Efi Karra Taniskidou - Jennifer Elston Lafata - David C Atkins - Ming Tai-Seale - Zac E Imel - Padhraic Smyth

Abstract Paper Code

Amid electronic health records, laboratory tests, and other technology, office-based patient and provider communication is still the heart of primary medical care. Patients typically present multiple complaints, requiring physicians to decide how to balance competing demands. How this time is allocated has implications for patient satisfaction, payments, and quality of care. We investigate the effectiveness of machine learning methods for automated annotation of medical topics in patient-provider dialog transcripts. We used dialog transcripts from 279 primary care visits to predict talk-turn topic labels. Different machine learning models were trained to operate on single or multiple local talk-turns (logistic classifiers, support vector machines, gated recurrent units) as well as sequential models that integrate information across talk-turn sequences (conditional random fields, hidden Markov models, and hierarchical gated recurrent units). Evaluation was performed using cross-validation to measure 1) classification accuracy for talk-turns and 2) precision, recall, and F1 scores at the visit level. Experimental results showed that sequential models had higher classification accuracy at the talk-turn level and higher precision at the visit level. Independent models had higher recall scores at the visit level compared with sequential models. Incorporating sequential information across talk-turns improves the accuracy of topic prediction in patient-provider dialog by smoothing out noisy information from talk-turns. Although the results are promising, more advanced prediction techniques and larger labeled datasets will likely be required to achieve prediction performance appropriate for real-world clinical applications.

Knowledge Enhanced Contextual Word Representations

EMNLP 2019

Mattew E. Peters - Mark Neumann - Robert L. Logan IV - Roy Schwartz - Vidur Joshi - Sameer Singh - Noah A. Smith

Abstract Paper Code

Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs.

Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

ACL 2019

Robert L. Logan IV - Nelson F. Liu - Mattew E. Peters - Matt Gardner - Sameer Singh

Abstract Paper Code Dataset

Modeling human language requires the ability to not only generate fluent text but also encode factual knowledge. However, traditional language models are only capable of remembering facts seen at training time, and often have difficulty recalling them. To address this, we introduce the knowledge graph language model (KGLM), a neural language model with mechanisms for selecting and copying facts from a knowledge graph that are relevant to the context. These mechanisms enable the model to render information it has never seen before, as well as generate out-of-vocabulary tokens. We also introduce the Linked WikiText-2 dataset, a corpus of annotated text aligned to the Wikidata knowledge graph whose contents (roughly) match the popular WikiText-2 benchmark. In experiments, we demonstrate that the KGLM achieves significantly better performance than a strong baseline language model. We additionally compare different language model's ability to complete sentences requiring factual knowledge, showing that the KGLM outperforms even very large language models in generating facts.

Bayesian Evaluation of Black-Box Classifiers

Workshop on Uncertainty & Robustness in Deep Learning @ ICML 2019

Disi Ji* - Robert L. Logan IV* - Padhraic Smyth - Mark Steyvers

Abstract

There is an increasing need for accurate quantitative assessment of the performance of prediction models (such as deep neural networks), out-of-sample, e.g., in new environments after they have been trained. In this context we propose a Bayesian framework for assessing performance characteristics of black-box classifiers, performing inference on quantities such as accuracy and calibration bias. We demonstrate the approach using three deep neural networks applied to large real-world data sets, performing inference and active learning to assess class-specific performance.

PoMo: Generating Entity-Specific Post-Modifiers in Context

NAACL 2019

Jun Seok Kang - Robert L. Logan IV - Zewei Chu - Yang Chen - Dheeru Dua - Kevin Gimpel - Sameer Singh - Niranjan Balasubramanian

Abstract Paper Dataset

We introduce entity post-modifier generation as an instance of a collaborative writing task. Given a sentence about a target entity, the task is to automatically generate a post-modifier phrase that provides contextually relevant information about the entity. For example, for the sentence, "Barack Obama, _______, supported the \#MeToo movement.", the phrase "a father of two girls" is a contextually relevant post-modifier. To this end, we build PoMo, a post-modifier dataset created automatically from news articles reflecting a journalistic need for incorporating entity information that is relevant to a particular news event. PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities. We use crowdsourcing to show that modeling contextual relevance is necessary for accurate post-modifier generation. We adapt a number of existing generation approaches as baselines for this dataset. Our results show there is large room for improvement in terms of both identifying relevant facts to include (knowing which claims are relevant gives a >20% improvement in BLEU score), and generating appropriate post-modifier text for the context (providing relevant claims is not sufficient for accurate generation). We conduct an error analysis that suggests promising directions for future research.

Multimodal Attribute Extraction

AKBC Workshop @ NeurIPS 2017

Robert L. Logan IV - Samuel Humeau - Sameer Singh

Abstract Paper Poster Code Dataset

The broad goal of information extraction is to derive structured information from unstructured data. However, most existing methods focus solely on text, ignoring other types of unstructured data such as images, video and audio which comprise an increasing portion of the information on the web. To address this shortcoming, we propose the task of multimodal attribute extraction. Given a collection of unstructured and semi-structured contextual information about an entity (such as a textual description, or visual depictions) the task is to extract the entity's underlying attributes. In this paper, we provide a dataset containing mixed-media data for over 2 million product items along with 7 million attribute-value pairs describing the items which can be used to train attribute extractors in a weakly supervised manner. We provide a variety of baselines which demonstrate the relative effectiveness of the individual modes of information towards solving the task, as well as study human performance.

Robert L. Logan IV

Ph.D. Student at UC Irvine

Research

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Yasaman Razeghi - Robert L. Logan IV - Matt Gardner - Sameer Singh

FRUIT: Faithfully Reflecting Updated Information in Text

NAACL 2022 (Best New Task)

Robert L. Logan IV - Alexandre Passos - Sameer Singh - Ming-Wei Chang

Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models

ACL Findings 2022, ENLSP Workshop @ NeurIPS 2021 (Best Poster)

Robert L. Logan IV - Ivana Balazevic - Eric Wallace - Fabio Petroni - Sameer Singh - Sebastian Riedel

Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference

ACL 2021

Robert L. Logan IV - Andrew McCallum - Sameer Singh - Dan Bikel

Active Bayesian Assessment for Black-Box Classifiers

AAAI 2021

Disi Ji - Robert L. Logan IV - Padhraic Smyth - Mark Steyvers

Deriving Behavioral Tests from Common Sense Knowledge Graphs

CSKGs Workshop @ AAAI 2021

Yasaman Razeghi - Robert L. Logan IV - Sameer Singh

AutoPrompt: Eliciting Knowledge from Language Models Using Automatically Generated Prompts

EMNLP 2020

Taylor Shin* - Yasaman Razeghi* - Robert L. Logan IV* - Eric Wallace - Sameer Singh

Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ

EMNLP 2020 - Demo

Qiang Ning - Hao Wu - Pradeep Dasigi - Dheeru Dua - Matt Gardner - Robert L. Logan IV - Ana Marasovic - Zhenjin Nie

COVIDLies: Detecting COVID-19 Misinformation on Social Media

NLP-COVID19 Workshop @ EMNLP 2020 (Best Paper)

Tamanna Hossain* - Robert L. Logan IV* - Arjuna Ugarte* - Yoshitomo Matsubara* - Sean Young - Sameer Singh

On Importance Sampling-Based Evaluation of Latent Language Models

ACL 2020

Robert L. Logan IV - Matt Gardner - Sameer Singh

Detecting Conversation Topics in Primary Care Office Visits from Transcripts of Patient-Provider Interactions

Journal of the American Medical Informatics Association, Volume 26, Issue 12, December 2019

Jihyun Park - Dimitrios Kotzias - Patty Kuo - Robert L. Logan IV - Kritzia Merced - Sameer Singh - Michael Tanana - Efi Karra Taniskidou - Jennifer Elston Lafata - David C Atkins - Ming Tai-Seale - Zac E Imel - Padhraic Smyth

Knowledge Enhanced Contextual Word Representations

EMNLP 2019

Mattew E. Peters - Mark Neumann - Robert L. Logan IV - Roy Schwartz - Vidur Joshi - Sameer Singh - Noah A. Smith

Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

ACL 2019

Robert L. Logan IV - Nelson F. Liu - Mattew E. Peters - Matt Gardner - Sameer Singh

Bayesian Evaluation of Black-Box Classifiers

Workshop on Uncertainty & Robustness in Deep Learning @ ICML 2019

Disi Ji* - Robert L. Logan IV* - Padhraic Smyth - Mark Steyvers

PoMo: Generating Entity-Specific Post-Modifiers in Context

NAACL 2019

Jun Seok Kang - Robert L. Logan IV - Zewei Chu - Yang Chen - Dheeru Dua - Kevin Gimpel - Sameer Singh - Niranjan Balasubramanian

Multimodal Attribute Extraction

AKBC Workshop @ NeurIPS 2017

Robert L. Logan IV - Samuel Humeau - Sameer Singh

Contact