MMA

Benchmarking Multi-Modal Large Language Models in Ambiguity Contexts

Ru Wang*¹, Kexin Song*¹, Liang Ding², Mingming Gong³, Yusuke Iwasawa¹, Yutaka Matsuo¹, Jiaxian Guo¹,

¹The University of Tokyo, ²The University of Sydney, ³The University of Melbourne,

*Equal Contribution
†Corresponding to: ru.wang, kexin.song, jiaxian.guo

Introduction

Multi-Modal Large Language Models (MLLMs) recently demonstrated strong capabilities in both instruction comprehension and generation, positioning them as promising tools for human-computer interaction. However, the inherent ambiguity of language poses a challenge, potentially leading models astray in task implementation due to differing interpretations of the same text within varying contexts. In multi-modal settings, visual information serves as a natural aid in disambiguating such scenarios. In this paper, we introduce the first benchmark specifically designed to evaluate the performance of MLLMs in Ambiguous contexts (MMA). This benchmark employs a multiple-choice visual question-answering format and includes 261 questions with ambiguous meaning. Each question is linked to a pair of images that suggest divergent scenarios, thus leading to different answers given the same question. These questions are stratified into three categories of ambiguity: lexical, syntactic, and semantic, to facilitate a detailed examination of MLLM performance across varying levels of ambiguity. By evaluating 16 proprietary and open-sourced MLLMs, we find that: (1) When presented with two different contextual images and asked the same question, MLLMs achieved an accuracy rate of only 50.59% in answering both correctly, compared to human performance at 88.97%. This is because MLLMs often overlook scenario-specific information provided by images. (2) Among the three types of ambiguity, models perform best under lexical ambiguity and worst under syntactic ambiguity. (3) Open-sourced models generally perform significantly lower than limited-access MLLMs, with an average performance gap of 10.66%. GPT-4o emerges as the top model, achieving 70.00\% accuracy. These results indicate that current MLLMs struggle with multi-modal ambiguity.

Overview

In order to systematically explore the capability of MLLMs to perceive and resolve ambiguities of varying complexities, we categorize ambiguities into lexical, syntactic, and semantic types based on the linguistic characteristics. The benchmark tasks are structured as multiple-choice VQA scenarios, a format that simplifies the evaluation process, where the meaning of each question is ambiguous, and they are associated with multiple images that provide varying contexts, allowing the same question to elicit different correct responses based on the visual information provided. This design forces the MLLMs to adeptly integrate and interpret both textual and visual data to select the most accurate answer, reflecting the true potential and challenges of deploying such models in diverse, ambiguity-filled environments.

The illustration of benchmark samples, where each sample consists of pairs of images, each associated with the same question. The model needs to answer the question based on the visual information presented in each image.

Ambiguity Accuracy (Amb_A) This metric is calculated as the percentage of questions where the model correctly answer for both paired images. A high Amb_A indicates that the model does not simply latch onto one possible interpretation of the ambiguity. Instead, it effectively integrates visual information from images to arrive at the most appropriate answer for each scenario.

Three Types of ambiguities

Lexical ambiguities. Lexical ambiguity mainly evaluates the ambiguity caused by polysemy in sentences. We considered the ambiguity caused by nouns, adjectives, and verbs. The verb category includes both the ambiguity of polysemy and the ambiguity of different emotions it may evoke.
Syntactic ambiguities. Syntactic ambiguities occur when sentence structures allow for multiple interpretations. There are three main types: (a) Attachment Ambiguity: This occurs when a modifying phrase, usually a prepositional phrase or clause, can logically attach to more than one part of the sentence. (b) Coordination Ambiguity: This happens when adjectives, adverbs, or other modifiers can ambiguously apply to one or more nouns in a series, creating uncertainty about whether the modifiers apply to all or just some elements. (c) Structural Ambiguity: This arises when verbs can be used in both transitive and intransitive forms, leading to different meanings.
Semantic ambiguities. Semantic ambiguities involve the broader meanings of text and their interaction with visual elements : (a) Idiomatic Ambiguity: This occurs with idiomatic expressions that can be interpreted both literally and metaphorically. (b) Pragmatic Ambiguity: This arises from interpreting a sentence in different contexts provided by visual cues, affecting how the listener or viewer understands the relevance and expected response. (c) Rhetorical Ambiguity: This involves the use of rhetorical devices like irony, sarcasm, or hyperbole, which can lead to multiple interpretations depending on the visual context.

Leaderboard

We evaluate evaluating 16 proprietary and open-sourced MLLMs.

Human Expert Open-Source Proprietary

Model	Adjective (30)	Noun (238)	Verb (16)	Attachment (24)	Coordination (46)	Structural (14)	Pragmatic (132)	Idiom (22)	Lexical (284)	Syntactic (84)	Semantic (154)	Overall (522)
Human Best	0.933	0.966	1.000	1.000	0.864	1.000	0.833	1.000	0.965	0.929	0.857	0.927
Human Average	0.827	0.931	0.825	1.000	0.900	0.629	0.824	0.982	0.914	0.886	0.847	0.890
GPT-4o	0.800	0.822	0.875	0.077	0.409	0.429	0.650	0.730	0.823	0.310	0.688	0.700
InternVL-Chat-V1-5	0.800	0.832	0.625	0.385	0.545	0.143	0.700	0.541	0.817	0.429	0.623	0.697
Gemini 1.5 Pro	0.786	0.755	0.833	0.538	0.591	0.143	0.737	0.382	0.762	0.500	0.569	0.660
GPT-4 Vision	0.867	0.748	0.625	0.231	0.409	0.286	0.675	0.622	0.754	0.333	0.649	0.655
VILA1.5-40b	0.733	0.807	0.625	0.231	0.545	0.000	0.600	0.378	0.789	0.357	0.494	0.632
LLaVA-NeXT-34B	0.867	0.798	0.500	0.077	0.591	0.000	0.400	0.405	0.789	0.333	0.403	0.602
HPT 1.5 Air	0.800	0.756	0.250	0.231	0.227	0.000	0.525	0.595	0.732	0.190	0.558	0.594
DeepSeek-VL	0.467	0.697	0.500	0.231	0.273	0.000	0.525	0.378	0.662	0.214	0.455	0.529
Gemini 1.0 Pro Vision	0.692	0.684	0.400	0.000	0.318	0.000	0.405	0.286	0.674	0.167	0.347	0.492
VILA1.5-13b	0.400	0.697	0.125	0.000	0.136	0.143	0.375	0.486	0.634	0.095	0.429	0.487
Yi-VL-34b	0.733	0.630	0.250	0.077	0.136	0.000	0.450	0.243	0.620	0.095	0.351	0.456
Cogvlm2	0.333	0.574	0.125	0.000	0.364	0.000	0.375	0.432	0.522	0.190	0.403	0.432
Claude 3 Opus	0.733	0.561	0.375	0.000	0.158	0.000	0.250	0.162	0.569	0.077	0.208	0.378
VILA1.5-3b	0.133	0.185	0.250	0.077	0.091	0.143	0.175	0.081	0.183	0.095	0.130	0.153
MiniCPM-Llama3-V 2.5	0.000	0.118	0.250	0.154	0.136	0.000	0.225	0.054	0.113	0.119	0.143	0.123

MMA

Benchmarking Multi-Modal Large Language Models in Ambiguity Contexts

Introduction

MMA Benchmark

Overview

Three Types of ambiguities

Comparisons with Existing Benchmarks

Statistics

Experiment Results

Leaderboard