
On Sept. 12, OpenAI released their new large language model family, o1, which is better at logical reasoning than its predecessors.
This is because o1 has been taught to use chain-of-thought reasoning. Rather than immediately returning an answer, it first generates a logical chain of thought, noticing and correcting its own errors — much like how a person thinks out loud. In order to allow clarity as to the actual thought process of the model, OpenAI “cannot train any policy compliance or user preferences onto the chain of thought,” and does not display the unaltered train of thought to users, although it does show a summary of the chain of thought, and o1 is instructed “to reproduce any useful ideas from the chain of thought in the answer.” This chain-of-thought process means that for complex questions, the model might think for a couple of minutes before answering.
Compared to GPT-4o, a past OpenAI model, o1 is much improved at logical tasks such as programming, doing math, and taking the LSAT (the law school entrance exam), and somewhat improved at most other tasks, except for writing stylishly. The model achieved a 49th percentile score in the International Olympiad in Informatics when allowed 50 submissions per problem, and the equivalent of a gold medal when allowed 10,000 submissions per problem.
Two o1 models have been released to the public. According to OpenAI, “OpenAI o1-preview is the early version of this model, while OpenAI o1-mini is a faster version of this model that is particularly effective at coding.”
OpenAI performed various safety evaluations of o1, both internal and external. Apollo Research found that o1 could perform “scheming” that was legible and probably would not lead to catastrophic consequences. (AI models are theorized to be at risk of developing destructive strategies for achieving their goals). METR “could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration,” and found that o1 could contribute to frontier AI research.
This model, being more thoughtful, is likely better at reward hacking, according to OpenAI, in which, like Amelia Bedelia, the model finds an unexpected and possibly harmful way to solve an underspecified problem. Apparently, during a capture-the-flag (CTF) challenge (a licit hacking exercise), the challenge’s container didn’t start correctly, leading o1 to scan the network to try to figure out what was going on. However, when it did this, it discovered that, due to a misconfiguration, it was able to simply start “a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container log.” Thus, the model used more hacking than it was expected to use in order to fulfill its goal. Safety researchers find this sort of behavior concerning because it can be dangerous for AI models to behave with unexpected methods and interpretations.
OpenAI’s o1 System Card also said that the model has a “medium” risk of biological threat creation because of its ability to perform biological research and instruct laboratory workers. The model was also rated as having a “medium” level of persuasion ability, but also having “low” autonomy. In general, the model was rated as “medium risk,” with “risk” referring to the possibility of engaging in or assisting with destruction or nefarious activities.