Observed safety challenges and evaluations
In addition to advancing language model capabilities, the o1 family’s ability to reason in context provides new opportunities for improving the safety of the model. The o1 models are our most robust models to date, achieving substantial improvements on our hardest jailbreak evaluations. They are also more aligned to the OpenAI policy, reaching state-of-the-art performance on our hardest internal benchmarks for evaluating adherence to our content guidelines.
The o1 model family represents a transition from fast, intuitive thinking to now also using slower, more deliberate reasoning. While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications. In this section, we outline the safety evaluations we conducted on this model, spanning harmfulness, jailbreak robustness, hallucinations, and bias evaluations. We then investigate risks involving the chain of thought itself, and describe our ongoing research on chain of thought deception monitoring. Finally, we detail the results of our external red teaming campaign.
Additionally, as part of our continued effort to partner with external experts, a set of pre-deployment evaluations were conducted on a version of the o1 model by the U.S. AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI), not included in this report.
Last updated