The Rising Cost of Evaluating Reasoning AI Models – The Technology Industry’s New Revelation.

Experts say that reasoning AI models are constantly being released, but they are more expensive to evaluate, making independent verification difficult.

A series of AI companies have launched AI models with the ability to think through steps, such as OpenAI and DeepSeek, which are more effective than previous models without reasoning.

However, according to data from Artificial Analysis, an independent AI testing and evaluation organization, the cost of "scoring" the OpenAI o1 reasoning model on seven popular benchmarks- MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500 is $2,767. Similarly, Claude 3.7 Sonnet, a hybrid model from Anthropic, costs $1,485.

Some AI applications created on smartphones.

Artificial Analysis said it spent a total of $5,200 to evaluate fewer than 10 reasoning models, double the $2,400 it spent evaluating more than 80 non-reasoning products. For example, the non-reasoning GPT-4o, released in May 2024, costs $108.85, the o3-mini costs $344, and the non-reasoning Claude 3.6 Sonnet costs $81.41.

Artificial Analysis isn’t the only one facing rising AI evaluation costs. Ross Taylor, CEO of AI General Reasoning, said he spent $580 to grade Claude 3.7 Sonnet with 3,700 suggested questions. He estimated that a single run of MMLU Pro—a set of questions designed to assess AI’s language understanding skills—costs more than $1,800.

Testing is expensive because the model generates so many tokens. Tokens are raw text; for example, the word “fantastic” is broken down into “fan”, “tas”, and “tic”. Artificial Analysis says OpenAI’s o1 generated more than 44 million tokens during the company’s testing, eight times the number GPT-4o generated.

Modern benchmarks also require the creation of many tokens because they involve complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI.

Some AI companies, including OpenAI, offer free or discounted access to some benchmarking organizations, but this is believed to influence the results, affecting the integrity of the scores.

Post a Comment

Previous Post Next Post