On Evaluating Adversarial Robustness of
Large Vision-Language Models

Yunqing Zhao1*, Tianyu Pang2*☨, Chao Du2☨, Xiao Yang3, Chongxuan Li4,
Ngai-Man Cheung1☨, Min Lin2
*Equal Contribution, Equal Advice
1Singapore University of Technology and Design
2Sea AI Lab, Singapore
2Tsinghua University     2Renmin University of China


Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision).

To this end, we propose evaluating the robustness of open-source large VLMs in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP, and then transfer these adversarial examples to other VLMs such as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we observe that black-box queries on these VLMs can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses.

Overall, Our investigation and findings in this work provide a quantitative understanding regarding the adversarial vulnerability of large VLMs and call for a more thorough examination of their potential security flaws before deployment in practice.

Overview of Proposed Method

Experiment Results

Visual question-answering (VQA) task implemented by MiniGPT-4. MiniGPT-4 has capabilities for vision-language understanding and performs comparably to GPT-4 on tasks such as multi-round VQA by leveraging the knowledge of large LMs. We select images with refined details generated by Midjourney [48] and feed questions (e.g., Can you tell me what is the interesting point of this image?) into MiniGPT-4. As expected, MiniGPT-4 can return descriptions that are intuitively reasonable, and when we ask additional questions (e.g., But is this a common scene in the normal life?), MiniGPT-4 demonstrates the capacity for accurate multi-round conversation. Nevertheless, after being fed targeted adversarial images, MiniGPT-4 will return answers related to the targeted description (e.g., A robot is playing in the field). This adversarial effect can even affect multi-round conversations when we ask additional questions.

Joint generation task implemented by UniDiffuser. There are generative VLMs such as UniDiffuser that model the joint distribution of image-text pairs and are capable of both image-to-text and text-to-image generation. Consequently, given an original text description (e.g., An oil painting of a bridge in rains. Monet Style), the text-to-image direction of UniDiffuser is used to generate the corresponding clean image, and its image-to-text direction can recover a text response (e.g., A painting of a bridge at night by Monet) similar to the original text description. The recovering between image and text modalities can be performed consistently on clean images. When a targeted adversarial perturbation is added to a clean image, however, the image-to-text direction of UniDiffuser will return a text (e.g., A small white dog sitting in the grass near a stream in Autumn) that semantically resembles the predefined targeted description (e.g., A small white dog sitting on the ground in autumn leaves), thereby affecting the subsequent chains of recovering processes.

Image captioning task implemented by BLIP-2. Given an original text description, DALL-E/Midjourney/Stable Diffusion is used to generate corresponding clean images. Note that real images can also be the clean image. BLIP-2 accurately returns captioning text (e.g., A field with yellow flowers and a sky full of clouds) that analogous to the original text description / the content on the clean image. After the clean image is maliciously perturbed by targeted adversarial noises, the adversarial image can mislead BLIP-2 to return a caption (e.g., A cartoon drawn on the side of an old computer) that semantically resembles the predefined targeted response (e.g., A computer from the 90s in the style of vaporwave).

Related Links

There's a lot of excellent works that builds large vision-language models in recent days, for example:

- LAVIS is a one-stop Library for Language-Vision Intelligence.

- MiniGPT-4 and LLaVA perform Vision Question-Answering on top of Large Language Models (LLMs).

- Unidiffuser can achieve multi-modal join generation by using a single ViT.


      title={On Evaluating Adversarial Robustness of Large Vision-Language Models},
      author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Li, Chongxuan and Cheung, Ngai-Man and Lin, Min},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems},

Meanwhile, a relevant research that aims to Embedding a Watermark to (multi-modal) Diffusion Models:

                title={A Recipe for Watermarking Diffusion Models},
                author={Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and Cheung, Ngai-Man and Lin, Min},
                journal={arXiv preprint arXiv:2303.10137},