|
Things happened recently (occasionally updated)
[2026-02] Our work on adaptive tool-integrated reasoning MLLM, CodeDance, and ThinkGen, are accepted by CVPR 2026!
[2026-01] Our work on generative models for normal estimation, RoSE, was accepted by ICLR 2026 as oral presentation!
[2025-08] My PhD thesis has been selected as the winner of Outstanding Thesis Award!
[2025-07] Our work on MoE based Video LLMs, TimeExpert, is accepted by ICCV 2025!
[2024-07] I received my PhD Degree from SUTD, and I joined TikTok / ByteDance Singapore as a research scientist.
|
|
Recent and Selected Publications
† denotes corresponding author(s); * denotes equal contribution; Full list of publications: [Google Scholar]
|
|
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Qi Song*,
Honglin Li*,
Yingchen Yu,
Haoyi Zhou,
Lin Yang,
Qi She,
Song Bai,
Zilong Huang†,
Yunqing Zhao†
Do we always need to integrate tools for visual reasoning?
While tool-augmented multimodal models (e.g., o-series models and Gemini) have demonstrated strong performance gains, indiscriminate tool invocation often results in unnecessary computation, instability, and reasoning inefficiency.
We introduce CodeDance, a scalable and dynamic tool-integrated MLLM that learns when and how to use tools, and critically, when tool invocation is unnecessary.
Through a difficulty-aware reinforcement learning objective (RBAT), the model balances exploration and efficiency, addressing the long-standing overuse/underuse pathology in tool-augmented systems.
Compared to prior tool-integrated approaches, CodeDance achieves strong improvements across counting, chart QA, and visual search/math benchmarks, while reducing reasoning turns and improving interaction efficiency.
CVPR 2026, Denver, Colorado, United States.
[Paper]
[Webpage]
[Code]
[SFT Dataset]
[RL Dataset]
|
|
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
Zuhao Yang,
Yingchen Yu,
Yunqing Zhao,
Shijian Lu†,
Song Bai
We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes Video Temporal Grounding (VTG) tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency.
Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
It achieved State-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
ICCV 2025, Honolulu, Hawaii, United States.
[Paper]
[Webpage]
[Code]
|
|
A Recipe for Watermarking (Multimodal) Diffusion Models
Yunqing Zhao,
Tianyu Pang†,
Chao Du†,
Xiao Yang,
Ngai-Man Cheung†,
Min Lin
We conduct comprehensive analyses and derive a recipe for efficiently watermarking state-of-theart DMs (e.g., Stable Diffusion), via training from scratch or finetuning.
Our recipe is straightforward but involves empirically ablated implementation details, providing a solid foundation for future research on watermarking DMs.
arXiv, 2023 & US Patent, 2024
[Paper]
[Webpage]
[Code]
|
|
Evaluating adversarial robustness of large vision-language models (VLMs)
Yunqing Zhao*,
Tianyu Pang*†,
Chao Du†,
Xiao Yang,
Chongxuan Li,
Ngai-Man Cheung†,
Min Lin
Large VLMs such as GPT-4 achieve unprecedented performance in response generation,
esp. with visual inputs, enabling more creative and adaptable interaction than LLMs like ChatGPT.
However, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision).
We evaluate the robustness of open-source large VLMs (e.g., MiniGPT-4, LLaVA, BLIP, UniDiffuser) in the most realistic and high-risk setting,
where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses.
NeurIPS 2023, New Orleans, Louisiana, United States.
[Paper]
[Webpage]
[Code]
|
|
Exploring incompatible knowledge transfer in few-shot image generation
Yunqing Zhao,
Chao Du,
Milad Abdollahzadeh,
Tianyu Pang,
Min Lin,
Shuicheng Yan,
Ngai-Man Cheung†
Through interpretable GAN dissection tools, we demonstrate that fine-tuning based methods cannot effectively remove knowledge that is incompatible
to the target domain after adaptation (e.g., trees /buildings on the sea) for few-shot image generation task.
We propose Remove In-Compatible Knowledge (RICK), an efficient and dynamic algorithm that estimates the filter importance and prune those are incompatible
to the target domain.
CVPR 2023, Vancouver, British Columbia, Canada.
[Paper]
[Webpage]
[Code]
|
|
FS-BAN: Born-Again Networks for Domain Generalization Few-shot Classification
Yunqing Zhao,
Ngai-Man Cheung†
We propose a method to improve the generalizability for cross-domain few-shot classification problem using born-again networks.
Our algorithm does not require additional parameters and training data and can be applied readily to many exisiting FSC models.
The key insight is to distill the dark knowledge from a teacher model with additional multi-task objectives designed specific for
cross-domain few-shot learning.
IEEE Trans. on Image Processing (TIP) 2023.
[Paper]
[Webpage]
[Code]
|
|
Few-shot image generation via adaptation-aware kernel modulation
Yunqing Zhao*,
Keshigeyan Chandrasegaran*,
Milad Abdollahzadeh*,
Ngai-Man Cheung†
When fine-tuning a pretrained image generator on few-shot target samples, we show that state-of-the-art algorithms perform no-better
than a simple baseline method when the target samples are distant to the source domain.
We propose AdAM, a parameter-efficient and target-aware method to select source knowledge important for few-shot adaptation.
NeurIPS 2022, New Orleans, Louisiana, United States.
[Paper]
[Webpage]
[Code]
|
|
Revisiting Label Smoothing & Knowledge Distillation Compatibility: What was Missing?
Keshigeyan Chandrasegaran,
Ngoc-Trung Tran*,
Yunqing Zhao*,
Ngai-Man Cheung†
We investigate the compatibility between label smoothing (LS) and knowledge distillation (KD), i.e., to smooth or not to smooth a teacher network?
We discover, analyze and validate the proposed systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings in prior works.
This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective.
ICML 2022, Baltimore, Maryland, United States.
[Paper]
[Webpage]
[Code]
|
|
A Closer Look at Few-shot Image Generation
Yunqing Zhao,
Henghui Ding,
Houjing Huang,
Ngai-Man Cheung†
We analyze the existing few-shot image generation algorithms in a unified testbed and find that
diversity degradation is the major issue during few-shot target adaptation.
Our proposed mutual information based algorithm can alleviate this issue and achieve state-of-the-art performance
on few-shot image generation tasks.
CVPR 2022, New Orleans, Louisiana, United States.
[Paper]
[Webpage]
[Code]
|
|
Workshop & Challenge
|
|
Explanation-guided Training for Cross-domain Few-shot Classification
Jiamei Sun, Sebastian Lapuschkin, Wojciech Samek, Yunqing Zhao,
Ngai-Man Cheung,
Alexander Binder†
ICML-2020 Workshop & ICPR-2020.
[Paper]
[Code]
|
|
CIKM-2020 Alibaba-Tsinghua Adversarial Challenge on Object Detection
Honglin Li,
Yunqing Zhao
Rank 10/1814 in the Challenge
CIKM-2020 Workshop.
[Paper]
[Code]
|
|
Selected Research Experience
|
|
Microsoft Research - Asia
Research Intern
11.2023 - 02.2024,
Exploring the applications/capability of LLMs for multi-modal understanding and generation.
|
|
Sea AI Lab, Singapore
Research Intern
09.2022 - 11.2023,
Work with Chao Du and Tianyu Pang
Advised by Prof. Shuicheng Yan and Min Lin
|
|
TikTok / ByteDance AI Lab, Singapore
Research Intern
08.2021 - 08.2022,
Work with Henghui Ding (now @ Fudan University, Shanghai) and Houjing Huang (now @ UZH, Switzerland)
|
|
ST Engineering - SUTD Cyber Security Lab, Singapore
Student Researcher
08.2020 - 07.2021,
Advised by Prof. Ngai-Man Cheung
|
|
University of Hong Kong
Research Assistant
Spent wonderful days in SouthLane & Pok Fu Lam Road
11.2018 - 04.2019,
Advised by Dr. Vincent Tam
|
|
Teaching & Service
|
|
Active Reviewer of
NeurIPS, CVPR, TPAMI, TIP, TIFS, TNNLS, TASL, TMM, TCSVT, CVIU, etc.
Graduate Teaching Assistant of
50.021 Artificial Intelligence
and
50.035 Computer Vision @ SUTD
|
|