Andrew Thompson

Monday, 16 June 2025

My talk at AI Engineer World's Fair 2025 in San Francisco

(This was re-posted from a blog post I recently wrote on Orbital's Tech Blog)

---

Background

I was at the AI Engineer Summit in NYC in February 2025, listening to engineers trade war stories about building agentic systems. Sitting in the crowd, it hit me: my team and I had our own scars, lessons, and hacks from shipping at the frontier and they were worth sharing. By the time the next conference rolled around in San Francisco, my talk had been selected.

Title: “Buy Now, Maybe Pay Later: Dealing with Prompt-Tax While Staying at the Frontier”

The core idea? Frontier LLMs now drop at warp speed. Each upgrade hits teams with a Prompt‑Tax: busted prompts, cranky domain experts, and evals that show up fashionably late. In my talk, I share 18 months of bruises (and wins) from shipping an agentic product for real‑estate lawyers:

• The challenge of an evolving prompt library that breaks every time the model jumps

• The bare‑bones tactics that actually work for faster migrations

• Our “betting on the model” mantra: ship the newest frontier model even when it’s rough around the edges, then race to close the gaps before anyone else does

I wanted listeners to walk away with a playbook to stay frontier‑fresh without blowing up their roadmap or their team’s sanity.

Talk

Here is the full 25 minute video of my talk for the AI Engineer World’s Fair held in San Francisco on June 3-5, 2025:

Slide Deck

Here is the complete slide deck I presented for my talk with references below:

Wednesday, 27 November 2024

Microsoft Azure's case study of Orbital Witness

I've been working with Microsoft for the last couple months on a case study they've written. It's about how Orbital Witness is revolutionising the legal tech sector and delivering efficiencies with Azure OpenAI.

In the case study we touch on the following concepts:

Pivoting early to GPT-4 and away from labelled data and more classical machine learning models
Building and productionising an AI Agent in late 2023 before AI Agents were a big thing
Embedding domain expertise into a product via prompt engineering
How LLM development requires a different mindset centred around iterating with human language
How customers are adapting to a generative AI product that behaves differently
A proprietary OCR system that enriches textual data

Here's that case study on Microsoft's website: https://customers.microsoft.com/en-gb/story/1827391161519296074-orbitalwitness-azure-professional-services-en-united-kingdom

Alternatively here's a PDF of the case study:

Wednesday, 21 August 2024

3 key insights from the release of "Simple Bench - Basic Reasoning" LLM benchmark

Background

From the paper Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Many of the best and brightest minds in AI have said that current benchmarks are not fit for purpose and are losing their utility. Popular benchmarks such as MMLU, HumanEval, DROP, GSM8K, HellaSwag and others have recently been "saturated" by the most intelligent models and many people are calling for better benchmarks to help propel the industry further forward. A number of benchmarks show models crushing human performance even when it's obvious that model intelligence and reasoning capability isn't quite at human level performance across the board. Those with a vested interested in the current progress of AI would like a better understanding of how model intelligence and reasoning capability is progressing with each new model release. It would be incredibly helpful if we could more precisely gauge model progress relative to human level performance.

In comes Philip of AI Explained to the potential rescue. Yesterday he released Simple Bench which is a basic reasoning benchmark. He created it because he couldn't find a good enough reasoning benchmark where the questions, phrased in english, could be easily and correctly answered by normal people but current frontier models might struggle with answering them due to their limited reasoning capabilities at present.

Philip has previously gone into detail about the problems he's found within benchmarks such as MMLU.

Insight #1 - Anthropic has the edge

Notice that GPT-4 Turbo-Preview (which was recently replaced by the newer GPT-4o) is sitting in 2nd place whereas GPT-4o is only sitting in 7th place with a difference of 10% in it's basic reasoning capability. In their May 13, 2024 blog post, OpenAI market GPT-4o by saying it "matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models." Based on Philip's Simple Bench scores this doesn't appear to be the case. He says that he's almost certain that OpenAI made GPT-4o a lighter model with fewer parameters (making it cheaper and quicker to run). That better price-to-performance ratio though has trade-offs in that it appears to have lost some reasoning capability.

The exception to this paradigm (where newer models being cheaper and faster but less capable with reasoning) is Anthropic's latest models. Claude 3 Opus is the biggest model from Anthropic and is sitting in 3rd place where as Claude 3.5 Sonnet (which is not the biggest model and therefore faster and cheaper) is sitting in 1st place and has slightly better reasoning performance. Philip says this is "unambiguous evidence that Anthropic has the secrete sauce currently with LLMs" given how it is able to push speed, cost and intelligence all in the right directions without an apparent trade-off. The Simple Bench score that Claude 3.5 Opus receives when released will be very telling and may further cement Anthropic's innovation edge in the LLM space.

My key takeaway

If you're deciding to buy and rollout a chatbot within your organisation or are beginning to invest in an LLM API to build products on, I'd pay particular attention to whether Anthropic does in fact continue to have the edge when Claude 3.5 Opus is released. I'd also keep tabs on OpenAI's next model release and see whether they reverse their current trend of sacrificing the model's reasoning capability for improvements in cost and speed.

Insight #2 - Models hallucinate but they also engage in sophistry

Current models can often pick up on important facts and even sometimes inform us of the importance of those facts in answering a the question but models aren't always able to link those facts together properly in order to come up with the correct answer. Its as if models can recall the facts but not always reason about them effectively. Take this example that Philip of AI Explained has hand crafted and then given to various models to answer and observes the results:

Some of the results:

Gemini 1.5 Pro is able to make the connection that "the bald man in the mirror is John" but then still gets the final answer wrong by saying that "John should apologise to the bald man" even though it's himself and thus one does not need to apologise to oneself

Claude 3.5 Sonnet says "The key realization here is that the 'bald man' John sees in the mirror is actually John himself. He's looking at his own reflection, and the lightbulb has hit him on the head." This is a good result however Claude then decides to eliminate an answer saying "C) is incorrect because someone else (the bald man) did get hurt"

Philip says he "sees these illogicalities all the time when testing Simple Bench" and goes on to say "models have snippets of good reasoning but often can't piece them together.". What seems to be happening is that "the right weights of the model are activated to at least say something about the situation but not factor it in or analyse it or think about it." Philip goes onto say that the paper Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting highlights that "models favour fluency and coherency over factuality" and interprets this as "models don't quite weigh information properly and therefore don't always have the right level of rationality that we'd expect."

My key takeaway

This isn't hallucination per se but it's another form of "sophistry" where the model can unintentionally deceive. The model is confidently recalling true and relevant facts but then sometimes follows this on by making blatant reasoning errors that a normal human probably wouldn't make themselves. When building products in various domains where accuracy and user trust is paramount these reasoning errors will need to be mitigated against.

Insight #3 - Slight variations in wording strongly affect performance

From the paper Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Philip explains that the above paper describes the scenario where "slight variations in the wording of the questions cause quite dramatic changes in performance across all models. Slight changes in wording triggers slightly different weights. It almost goes without saying that the more the models are truly reasoning the less the difference there should be in the wording of the question."

Given this sensitivity to specific wording, he goes onto describe the next potential paradigm with LLMs where when an LLM is given a prompt is can effectively rewrite it to be more optimal. The LLM can effectively "actively search for a new prompt to use on itself, a sort of active inference in a way" which produces better results than the user's original prompt.

My key takeaway

Iterating on prompts by changing the wording matters and this should be built into the product development and improvement process your team operate within. Evals really help here as they allow for more systematic iteration on prompts since they allow for repeated performance evaluation across a range of outcomes (not just a select few) that matter most to the users of your product. The workflow for your team could be:

Write a prompt
Run the eval system and record it's overall score
Iterate on that prompt
Re-run the eval system
If the overall score improves adopt the prompt change, if not continue to iterate on it or move onto other work if you're unable to squeeze out more performance

It's hard to know for sure but will AI labs train future models to be more resilient to these wording variations. If Philip is right this seems likely given that for a model to "truly reason" differences in wording should not produce wildly difference levels of performance. In the meantime how should a product be built? There are a plethora of prompt rewrite libraries to choose from but these feel like more of a plaster solution than an actual solution itself.

Update

As of Nov 27, 2024 Simple Bench has been updated with o1-preview, Grok 2 and newer Claude 3.5 Sonnet and Gemini 1.5 Pro models:

Thursday, 18 July 2024

The new AI reality: what's happening now and what's next?

(This was re-posted from a blog post I recently wrote on Orbital Witness's Tech Blog)

---

Annual keynote to law firm real estate partners

Context

A year ago Orbital Witness held our annual event where I spoke about “Generative AI: Opportunities and risks for property transactions”. Earlier this month, we continued this tradition by hosting ‘The AI Edge: Real-world Lessons for Real Estate Lawyers’. Taking place at Google HQ, we brought together real estate partners from some of the leading law firms in the UK to share the latest developments in Generative AI and how it is currently revolutionising real estate legal.

I gave the keynote presentation which started by setting the scene for where we are on the innovation S-Curve. I then delved into a range of important aspects of Generative AI for real estate lawyers such as model intelligence, context window size, model cost & speed, proprietary vs open-weight models, AI Agents, multi-modality and use cases in real estate legal. During the discussion, I was also able to contextualise for the audience the trajectory of Orbital Copilot, our own AI legal assistant, as we continue to innovate the product and as Generative AI advances.

This visual sums up the incredible achievement of what’s now possible at the bleeding edge of Generative AI when combining large language models (LLMs) within an AI Agent framework and focusing on a specific practice area, real estate legal, in order to provide turnkey solutions to customers:

Presentation

Here is the full 40 minute video of the keynote:

Slide Deck

Here is the complete slide deck I presented for my keynote with references below:

Slide 6: https://www.tooltester.com/en/blog/chatgpt-statistics

Slide 11: https://www.ben-evans.com/presentations

Slide 19: https://public.flourish.studio/visualisation/18163738

Slide 24: https://www.linkedin.com/feed/update/urn:li:activity:7183400918934016001

Slide 25: https://twitter.com/AIExplainedYT/status/1793561610730320338

Slide 27: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233

Slide 28: https://www.youtube.com/watch?v=PeSNEXKxarU

Slide 29: https://mmmu-benchmark.github.io

Slide 32: https://x.com/LibertyRPF/status/1658497036080017408

Slide 42: https://www.linkedin.com/feed/update/urn:li:activity:7183501457684365314

Slide 42: https://www.youtube.com/watch?v=DQacCB9tDaw

Slide 44: https://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance

Slide 47: https://ai.meta.com/blog/meta-llama-3

Slide 48: https://x.com/maximelabonne/status/1790519226677026831

Slide 52: https://tech.orbitalwitness.com/posts/2023-06-27-genai-opportunities-and-risks-for-property-transactions

Slide 53: https://tech.orbitalwitness.com/posts/2024-01-10-we-built-an-ai-agent-that-thinks-like-a-real-estate-lawyer

Slide 55: https://lilianweng.github.io/posts/2023-06-23-agent

Slide 56: https://sierra.ai

Slide 58: https://www.paulweiss.com/resources/podcasts/waking-up-with-ai/waking-up-with-ai-list/2024/april/ep-6-autonomous-ai-agents-are-a-hot-topic-for-2024

Slide 69: https://www.ben-evans.com/benedictevans/2024/4/19/looking-for-ai-use-cases

Saturday, 1 June 2024

Teaching my eleven year old’s class about AI

(This was re-posted from a blog post I recently wrote on Orbital Witness's Tech Blog)

---

What happens when you’re asked to teach 30 eleven year olds about AI? I did just that at my son’s school last month. Here’s what I learnt.

Background

My son came home a few months ago and said “Daddy, you need to come to my class at school very, very soon and teach us everything about AI.”

That’s clearly no small feat trying to unpack the complexity of artificial intelligence (AI) for an audience of eleven year olds but I was up for the challenge. I wanted to strike the right balance between being informative about what powers AI and how it works but also show them the creative ways AI is being applied in the world right now (because that’s the really fun stuff). I set out to create a visual presentation that would achieve that along with providing ample time for the myriad of questions they would inevitably pop up throughout the lesson.

Presentation

Here’s the 37 slide deck that I used for around 30 children. I was allotted an hour but due to all the questions, from both students and teachers, we ended up spending an hour and half in total getting into some of the nitty gritty of AI:

Some observations

From the ethics around AI, to what it means for the future of work and creativity, here’s some of the themes that came out of our time together:

The kids I taught had a real thirst for a deeper understanding of AI. They really wanted to know how it worked, what it could help them accomplish, what the risks and ethical considerations were with using it. They even wanted to know why exactly AI doesn’t always say naughty things they sometimes ask it to say.
I was pleasantly surprised how philosophical some kids were evidenced by questions such as “given that you’ve been coding for years and you’re now coding AI, did you ever think that maybe someone is actually coding you…?”
There was a bit of an undercurrent of what they do career-wise if AI is going to end up being able to do anything. If they’re interested in music making or art or coding and AI ends up being good at those, is it worth still pursuing those interests…?

Overall it was a fascinating lesson not only for them to learn about AI but also for me to see how they perceive the latest developments in this technology. This tidal wave of innovation is underway and it’s already impacting a future generation of creators and builders who will enter the workforce in a decade from now. And – if you end up using this deck to educate your own children, let me know. I’d love to hear what you learn from the experience.