OpenAI Releases Open-Weight Safety Models for Custom Online Moderation

News summary

OpenAI has launched two new open-weight reasoning models, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, designed to allow developers to apply custom safety policies dynamically at inference time rather than embedding fixed policies during training. These models, fine-tuned from the gpt-oss base and licensed under Apache 2.0, enable platforms to rapidly adapt safety measures to emerging risks such as fraud, self-harm, or game-specific abuse by interpreting developer-provided policy rules and providing an auditable chain of reasoning. The 120B parameter model fits on an 80GB GPU, while the 20B model is optimized for smaller GPUs with 16GB VRAM, making them accessible for varied deployment environments. OpenAI emphasized that these models outperform GPT-5 on internal multi-policy benchmarks and are part of a broader "defense-in-depth" safety strategy already used internally, dedicating up to 16% of compute resources to safety reasoning. The release is in collaboration with partners including ROOST, Discord, and SafetyKit, with ROOST also launching a community hub to support the development of open-source safety infrastructure for smaller platforms. This move addresses concerns about AI safety and ethics, offering transparency and control to developers while supporting rapid safety policy updates without retraining large models.

Story Coverage

Hide paywalls