Background
Many of the best and brightest minds in AI have said that current benchmarks are not fit for purpose and are losing their utility. Popular benchmarks such as MMLU, HumanEval, DROP, GSM8K, HellaSwag and others have recently been "saturated" by the most intelligent models and many people are calling for better benchmarks to help propel the industry further forward. A number of benchmarks show models crushing human performance even when it's obvious that model intelligence and reasoning capability isn't quite at human level performance across the board. Those with a vested interested in the current progress of AI would like a better understanding of how model intelligence and reasoning capability is progressing with each new model release. It would be incredibly helpful if we could more precisely gauge model progress relative to human level performance.
In comes Philip of AI Explained to the potential rescue. Yesterday he released Simple Bench which is a basic reasoning benchmark. He created it because he couldn't find a good enough reasoning benchmark where the questions, phrased in english, could be easily and correctly answered by normal people but current frontier models might struggle with answering them due to their limited reasoning capabilities at present.
Philip has previously gone into detail about the problems he's found within benchmarks such as MMLU.
Insight #1 - Anthropic has the edge
Notice that GPT-4 Turbo-Preview (which was recently replaced by the newer GPT-4o) is sitting in 2nd place whereas GPT-4o is only sitting in 7th place with a difference of 10% in it's basic reasoning capability. In their May 13, 2024 blog post,
OpenAI market GPT-4o by saying it
"matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models." Based on Philip's Simple Bench scores this doesn't appear to be the case. He says that he's almost certain that OpenAI made GPT-4o a lighter model with fewer parameters (making it cheaper and quicker to run). That better price-to-performance ratio though has trade-offs in that it appears to have lost some reasoning capability.
The exception to this paradigm (where newer models being cheaper and faster but less capable with reasoning) is Anthropic's latest models. Claude 3 Opus is the biggest model from Anthropic and is sitting in 3rd place where as Claude 3.5 Sonnet (which is not the biggest model and therefore faster and cheaper) is sitting in 1st place and has slightly better reasoning performance. Philip says this is "unambiguous evidence that Anthropic has the secrete sauce currently with LLMs" given how it is able to push speed, cost and intelligence all in the right directions without an apparent trade-off. The Simple Bench score that Claude 3.5 Opus receives when released will be very telling and may further cement Anthropic's innovation edge in the LLM space.
My key takeaway
If you're deciding to buy and rollout a chatbot within your organisation or are beginning to invest in an LLM API to build products on, I'd pay particular attention to whether Anthropic does in fact continue to have the edge when Claude 3.5 Opus is released. I'd also keep tabs on OpenAI's next model release and see whether they reverse their current trend of sacrificing the model's reasoning capability for improvements in cost and speed.
Insight #2 - Models hallucinate but they also engage in sophistry
Current models can often pick up on important facts and even sometimes inform us of the importance of those facts in answering a the question but models aren't always able to link those facts together properly in order to come up with the correct answer. Its as if models can recall the facts but not always reason about them effectively. Take this example that Philip of AI Explained has hand crafted and then given to various models to answer and observes the results:
Some of the results:
- Gemini 1.5 Pro is able to make the connection that "the bald man in the mirror is John" but then still gets the final answer wrong by saying that "John should apologise to the bald man" even though it's himself and thus one does not need to apologise to oneself
- Claude 3.5 Sonnet says "The key realization here is that the 'bald man' John sees in the mirror is actually John himself. He's looking at his own reflection, and the lightbulb has hit him on the head." This is a good result however Claude then decides to eliminate an answer saying "C) is incorrect because someone else (the bald man) did get hurt"
Philip says he
"sees these illogicalities all the time when testing Simple Bench" and goes on to say
"models have snippets of good reasoning but often can't piece them together.". What seems to be happening is that
"the right weights of the model are activated to at least say something about the situation but not factor it in or analyse it or think about it." Philip goes onto say that the paper
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting highlights that
"models favour fluency and coherency over factuality" and interprets this as
"models don't quite weigh information properly and therefore don't always have the right level of rationality that we'd expect."
My key takeaway
This isn't hallucination per se but it's another form of "sophistry" where the model can unintentionally deceive. The model is confidently recalling true and relevant facts but then sometimes follows this on by making blatant reasoning errors that a normal human probably wouldn't make themselves. When building products in various domains where accuracy and user trust is paramount these reasoning errors will need to be mitigated against.
Insight #3 - Slight variations in wording strongly affect performance
Philip explains that the above paper describes the scenario where "slight variations in the wording of the questions cause quite dramatic changes in performance across all models. Slight changes in wording triggers slightly different weights. It almost goes without saying that the more the models are truly reasoning the less the difference there should be in the wording of the question."
Given this sensitivity to specific wording, he goes onto describe the next potential paradigm with LLMs where when an LLM is given a prompt is can effectively rewrite it to be more optimal. The LLM can effectively "actively search for a new prompt to use on itself, a sort of active inference in a way" which produces better results than the user's original prompt.
My key takeaway
- Iterating on prompts by changing the wording matters and this should be built into the product development and improvement process your team operate within. Evals really help here as they allow for more systematic iteration on prompts since they allow for repeated performance evaluation across a range of outcomes (not just a select few) that matter most to the users of your product. The workflow for your team could be:
- Write a prompt
- Run the eval system and record it's overall score
- Iterate on that prompt
- Re-run the eval system
- If the overall score improves adopt the prompt change, if not continue to iterate on it or move onto other work if you're unable to squeeze out more performance
- It's hard to know for sure but will AI labs train future models to be more resilient to these wording variations. If Philip is right this seems likely given that for a model to "truly reason" differences in wording should not produce wildly difference levels of performance. In the meantime how should a product be built? There are a plethora of prompt rewrite libraries to choose from but these feel like more of a plaster solution than an actual solution itself.