The world of Large Language Models (LLMs) is getting ever more crowded while the differences in the LLMs’ language understanding are growing ever smaller. This commoditization of the base capabilities inside LLMs is causing the biggest LLM makers (with the highest market caps) to look beyond basic language understanding for differentiation. OpenAI is rumored to be aggressively pursuing the incorporation of agents into its flagship model, ChatGPT.
Beyond MMLU: Massive Multitask Language Understanding (MMLU) is a comprehensive benchmark designed to evaluate the language understanding capabilities of AI models across a wide array of subjects. It is one of many benchmarks used to rate one Large Language Model (LLM) against another. It seems that every week another LLM (or another major version of an existing LLM) is released and proclaimed to be the best based on the benchmarks. For example, five days ago Anthropic released Claude 3 claiming that it had the best performance across many benchmarks, including MMLU. Unfortunately for Anthropic, Claude 3 holds the number one position on the MMLU benchmark with a score of 86.8% versus ChatGPT4’s score of 86.4%. Given that OpenAI has a reported market cap of over $80B, it must stay far ahead of the competition to maintain that super-valuation. A small lead over the competition in language understanding will not continue to levitate OpenAI’s value.
Uneasy is the head that wears a crown. OpenAI is rumored to be hard at work on adding agents to its platform. The powerhouse behind ChatGPT is advancing the AI frontier with software designed to operate devices and automate tasks, potentially altering the digital workspace. Poised to extend beyond the cloud, this new agent software aims to streamline tasks such as transferring data between applications and managing expense reports by commandeering a user’s device with permission, replicating human-like interactions.
Rumblings in Redmond? The envisioned software will enable OpenAI to transform ChatGPT into a “supersmart personal assistant,” enhancing productivity and potentially challenging Microsoft’s enterprise app automation. The agent’s development is rooted in LLMs (large language models) and seeks to resolve the limitations of current productivity bots by integrating capabilities to understand and execute code, manage images, and retrieve files. With this, OpenAI aims to build an AI agent that not only engages in conversations but acts as a quasi-operating system. Given its encroachment on Microsoft’s historical turf, one wonders how long the OpenAI – Microsoft flirtation will endure. It may soon be time for the two companies to either get married or go their separate ways.
Not so secret agents. The dual agents under development—one that operates within the user’s device and another that handles web-based tasks—are critical to OpenAI’s ambition. These agents, leveraging the company’s colossal $80+ billion valuation, aim to propel its technology to the forefront, despite looming competition from Google’s forthcoming Gemini LLM.
And away we go. OpenAI’s move into autonomous agent software aligns with broader industry trends and promises to redefine human-computer interaction. While still navigating the complex terrain of public acceptance and technical feasibility, OpenAI’s shift indicates a future where AI does more than answer questions—it actively participates in fulfilling tasks, potentially reshaping the technological landscape and the very fabric of daily digital routines.