AI

OpenAI’s Strawberry (or o1) is great at math and reasoning. But how well can it write?

Last Updated:
December 9, 2024

End last week, OpenAI released the first of their new series of AI models, code named Strawberry, real name o1.

The new models are “designed to spend more time thinking before they respond”. This means that rather than coming up with an answer in a millisecond like most AI models do, it goes through an internal thought process before producing a good response.

Why Strawberry?

When asked how many ‘r’s there are in the word strawberry, ChatGPT confidently said—two.

Many AI models face this counting error due to tokenization, where LLMs break down text into smaller units for processing. They don’t go by individual letters, they go by tokens ‘str’, ‘aw’ and ‘berry’ when it goes through GPT-4’s tokenizer.

Source: Open AI

Tokenization helps AI models process text more efficiently. But results in limitations like these.

The new series of AI models that OpenAI released can now tell us there are 3 ‘r’s in the word strawberry. More importantly, it can solve a lot more complex problems than previous models, especially in science, coding and math.

Noam Brown says this is not the reason1 why they call it Strawberry. But it’s hard to believe so we’ll stick with the original story.

Also, sources2 say that Strawberry was formerly known as Q*, which was said to be the mysterious AI model that got OpenAI’s CEO Sam Altman fired3.

Alright, enough of the name-guessing and drama.

But last quick fact—why is it named o1?

According to OpenAI, “for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.”

What is o1?

OpenAI’s o1 is their newest model trained with reinforcement learning to solve more complex tasks and tougher problems. It does this by spending more time thinking before responding.

When given a problem to solve, the model goes through Chain-of-Thought (CoT) reasoning. This includes breaking down a problem into smaller ones and taking a step-by-step approach to solving them.

It learns to recognize and correct its mistakes and tries a different approach when they hit a roadblock or realize that their current method wouldn’t bring them to a solution.

The model also works within constraints.

Peering into its chain of thought, it’s interesting to see that it took the amount of time it has as consideration, much like what a human would do.

Source: OpenAI

In a series of tests done by OpenAI,

o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

With reinforcement learning, it also consistently improves in performance and accuracy.

Source: OpenAI

Comparing o1 and 4o

Source: OpenAI

When comparing the two models across various human exams and ML benchmarks, o1 exceeds 4o in many fronts, especially in areas where reasoning is heavily required.

But while it’s clear that o1 performs better in complex reasoning tasks, 4o still thrives in many common use cases.

For one, 4o is much faster at responding.

When given the same grade 6 math question, o1 took 26 seconds to come up with the answer while 4o worked through it immediately.

Both got the right answer.

Response by o1

Secondly, the current o1 model has no access to the internet, which makes it hard to be factually accurate or up-to-date with its response. It’s also not multimodal—which means it cannot parse images or audio at this point.

This makes it less useful when you’re trying to do certain tasks like debugging or writing code—tasks where it’s easier to just show the AI model what you’re seeing on the screen.

o1 in writing content

Looking at its strong capabilities in reasoning-heavy tasks, we’re curious how o1 would fare in terms of linguistics and writing.

Therefore, we decided to put it to the test.

Writing blog articles

As o1 doesn’t have access to the internet, we weren’t able to fairly judge it on a blog article written to rank for SEO since that’d require competitive analysis in real time.

Therefore, we’re reviewing it purely on the basis of writing a long-form article, and mainly fluency, logic and flow.

We also compared it with 4o, with only 1 prompt (i.e. no further prompting or guidance past the first generation attempt).

Here’s our prompt and the article is here.

Can you write a blog article on personal finance and budgeting tips for an early retirement?

My target audience is working professionals who have worked for at least 5 years. The tone should be enthusiastic and casual.

Provide practical tips for managing personal finances, covering budgeting, saving, debt management, and investment strategies.

The written article is fluent and concise. There’s variation in sentence structure and length, making it easy to read and digest.

But it’s interesting to note that the entire article is written in bullets with paragraphs kept a minimum. This makes it feel more like a checklist rather than a fun-reading article, which is alright if there’s further elaboration within each bullet (there isn’t).

The content makes sense in general and the advice given is accurate and logical—but it’s lacking in details and elaboration, which makes it not super useful.

The enthusiastic tone shines through and with that, the model uses snappier phrases and a play on words like “Small savings add up big time” and “Let your money work hard so you don’t have to”.

Overall, the article is delightful to read. It’s logical and fluent, but requires a lot more meat to qualify as a good piece.

Of course, it might work with additional prompts,

To write an article that’s fluent, meaty and optimized for search engines like Google, try our SEO AI article writer.

Writing product descriptions for ecommerce

For product descriptions, we prompted o1 with this:

Write a one-paragraph product description for black-out curtains in modern style, light grey

It spun up this description:

Enhance your living space with our Modern Light Grey Blackout Curtains, where sleek design meets exceptional functionality. Crafted from premium, high-density polyester, these curtains effectively block out sunlight and harmful UV rays, creating a restful ambiance for sleep or relaxation at any time of day. The sophisticated light grey hue complements contemporary interiors, adding a touch of elegance to bedrooms, living rooms, or offices. Easy to install with durable stainless steel grommets, they also offer energy-saving insulation by maintaining indoor temperatures and reducing noise. Transform your home with curtains that deliver both style and substance.

The results was good, highlighting not just features but also the benefits that follow. The model made up some product details that were not given—like high-density polyester and stainless steel grommets.

With the same prompt, the 4o model didn’t include the extra made-up details.

Elevate your home with our Modern Light Grey Blackout Curtains, designed to blend seamlessly with contemporary decor. These curtains provide complete light blockage, ensuring a peaceful environment for restful sleep or relaxation. Their soft, minimalist shade of grey adds a touch of sophistication to any room, while also offering practical benefits like improved insulation and noise reduction. Perfect for bedrooms, living rooms, or offices, these blackout curtains deliver the ideal combination of style, comfort, and functionality.

That gave us some clue that the o1 model, given its stronger reasoning ability and thought process, made some assumptions around what a good blackout curtain would be like. This may not be a good or bad feature, as we could easily remove the additional details with a single prompt.

While the product descriptions sound fluent, reasonable and compelling, they lacked character and uniqueness—sounding plain vanilla and like we’ve read it somewhere else.

If you’re looking to write product descriptions that are unique and engaging, try Hypotenuse AI’s product description generator. Not only does it write fluently, highlighting benefits throughout, it also uses words that trigger emotions and visual imagination—making it more compelling and persuasive.

You could also use our bulk generator to craft descriptions at scale, making it easier and faster to launch new collections.

Potential risks

While o1 is rated low risk in Model Autonomy and Cybersecurity, it’s given a medium risk in Chemical, Biological, Radiological and Nuclear (CBRN) and Persuasion.

Why medium?

Medium risk in CBRN

Internal evaluations by OpenAI found that the models can “help experts with the operational planning of reproducing a known biological threat”. But because these experts already have significant domain expertise, it limits the risk.

“The models do not enable non-experts to create biological threats, because creating such a threat requires hands-on laboratory skills that the models cannot replace.”

Medium risk in Persuasion

The internal evaluations also revealed that while o1 can produce written arguments that are as persuasive as arguments by human on the same topics, they are not able to rival top human writers.

But in some tests by external evaluators like Apollo Research, particularly to test for evidence of scheming, they found that o1 could possibly deceive humans in a bid to achieve the goals within constraints stated in the prompt.

The issue with this is an alignment problem, where AI models don’t possess human values like fairness or morality.

That’s where it’s important for human oversight and logical thought when using AI.

As an AI content platform that offers advanced AI content capabilities for ecommerce brands, we build workflows that enable users to have a high degree of control at every step of the way.

This makes sure that brands put out information that’s always accurate and backed by sources, and accuracy in terms of what ecommerce brands would expect in the content they produce, whether it’s in areas of brand tone of voice, factual accuracy or custom formatting.

Conclusion

While o1 (or Strawberry) excels in math, coding and science, there’s huge room for improvement in generating copy and content.

As OpenAI’s CEO Sam Altman shares that it’s “still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it."

It’d be interesting to see where the new series of models can take us, given that it’s still in its early stages. For now, math and coding exams are still on.

1 https://x.com/polynoamial/status/1834312400419652079

2 https://www.reuters.com/technology/artificial-intelligence/openai-working-new-reasoning-technology-under-code-name-strawberry-2024-07-12/

3 https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/

Sushi
Growth
Sushi has years of experience driving growth across ecommerce, tech and education. She gets excited about growth strategy and diving deep into channels like content, SEO and paid marketing. Most importantly, she enjoys good food and an excellent cup of coffee.
Sushi
Growth

Sushi has vears of experience driving growth across ecommerce, tech and education. She gets excited about growth strateay and diving deep into channels like content, SEO and paid marketing. Mostlimportantly, she enjoys good food and an excellent cup of coffe

Join 500,000+ growing brands with Hypotenuse AI.

Create marketing and product content that sounds like you. SEO-optimized, accurate and on-brand.