Scope of testing
As part of our commitment to iterative deployment, we continuously refine and improve our models. The evaluations described in this System Card pertain to the full family of o1 models, and exact performance numbers for the model used in production may vary slightly depending on system updates, final parameters, system prompt, and other factors.
More concretely, for o1, evaluations on the following checkpointsB are included:
o1-near-final-checkpoint
o1-dec5-release
Between o1-near-final-checkpoint and the releases thereafter, improvements included better format following and instruction following, which were incremental post-training improvements (the base model remained the same). We determined that prior frontier testing results are applicable for these improvements. Evaluations in our safety evaluations, as well as chain-of-thought safety and multilingual evaluations were conducted on o1-dec5-release, while external red teaming and preparedness evaluations were conducted on o1-near-final-checkpoint.C
Last updated