ProGuard: Towards Proactive Multimodal Safeguard

1Shanghai Artificial Intelligence Laboratory, 2PRLab, Nanjing University, 3Beihang University
*Equal Contribution,Corresponding Author
Introduction

Comparison of moderation workflows between Reactive Guard (left) and Proactive Guard (right). After determining safety, Reactive Guard can only perform multi-class classification of safety risks based on a provided static taxonomy. In contrast, Proactive Guard first determines whether existing safety risks belong to known categories in the static taxonomy. If not, it infers reasonable safety category names. Therefore, Proactive Guard not only possesses the ability to understand safety policies but also generates reasonable safety category descriptions.

Abstract

The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

ProGuard Overview

Overview of ProGuard. The framework combines moderation task design, a multimodal safety taxonomy and a balanced safety dataset, and an online reinforcement learning pipeline to build a reasoning-enhanced proactive guard model.

Table 1

On the Binary Safety Classification task, results measured using F1-Score demonstrate that ProGuard achieves performance comparable to closed-source large model APIs with fewer parameters.

Table 2

On the Unsafe Content Categorization task, results measured using Accuracy show that ProGuard outperforms all open-source guard models, though there remains a performance gap compared to closed-source APIs.

Table 3

On the OOD safety category inference task, models must first determine whether a dialogue belongs to any category in the static safety taxonomy. Results measured using F1-score show that ProGuard outperforms closed-source large models on in/out-of-taxonomy classification.

Table 4

On the OOD safety category inference task, models need to provide a safety category inference when the input is out of taxonomy. Evaluation using the proposed similarity-based reward reveals that ProGuard's performance approaches that of closed-source large models.

BibTeX

@article{yu2025proguard,
  title={ProGuard: Towards Proactive Multimodal Safeguard},
  author={Yu, Shaohan and Li, Lijun and Si, Chenyang and Sheng, Lu and Shao, Jing},
  journal={arXiv preprint arXiv:2512.23573},
  year={2025},
  url={https://yushaohan.github.io/ProGuard/}
}