How can a bank pick the AI model that suits it best?

OpenAI-Anthropic-Cohere.jpg
Financial institutions are generally familiar with ChatGPT, but how does it compare to Anthropic's model Claude and Cohere's language models?

As financial institutions experiment with large language models, OpenAI and its ChatGPT product have become the most recognizable names in the artificial intelligence space. Yet alternatives exist.

So the biggest question many banks, research firms, academics and others are racing to answer is this: How does a company know which language model is right for it?

One of OpenAI's main competitors is Anthropic, which ​​recently landed a $1.25 billion investment from Amazon for a minority stake in the company. Another is Cohere, which had been valued at over $2.2 billion in June, according to Reuters. Each offers proprietary language models that companies can fine-tune to their own purposes through an application programming interface.

In other words, to take advantage of OpenAI's model GPT-4 (aka ChatGPT), Cohere's Command or Anthropic's Claude, banks must typically trust these companies with their data — and some have. For example, Morgan Stanley has provided GPT-4 with access to its content library of "hundreds of thousands of pages" of information to make it easier for employees to tap the company's collective knowledge, according to OpenAI.

Similarly, Anthropic and Cohere each allow companies to tweak their language models through their own APIs or through Amazon Web Services. Specifically, Claude and Command are available through Amazon Bedrock, a service that allows companies to build AI applications on Amazon's cloud infrastructure.

Bedrock also offers access to language models built by other companies, including Meta, AI21 and Amazon itself. These models are sometimes called foundation models, a term that refers to the ability to build on the models as a foundation for customization purposes.

Attendees walk past a signage for Amazon Web Services (AWS) Summit in San Francisco, California, U.S., on Wednesday, April 19, 2017. Amazon.com Inc. Web Services chief executive officer Andy Jassy is leading a push into artificial intelligence to boost Amazon's cloud computing, which commands about 45 percent of the market for infrastructure as a service, where companies buy basic computing and storage power from the cloud. Photographer: David Paul Morris/Bloomberg

Through its commitment of up to $4 billion in Anthropic, the e-commerce giant can bring advanced analysis, shopping and checkout tech closer to the point of sale — and farther away from traditional transaction processors.

September 28

One downside to modifying proprietary models for a specific application is the need to relinquish some control of the data. The companies have developed some protections; for example, Amazon promises data in its Bedrock product "is always encrypted in transit and at rest, and you can encrypt the data using your own keys."

Still, financial institutions have options for building with language models without sharing their data with an AI company or racking up cloud-computing bills. Many companies offer free and open-source models that developers can download and use on their own computers.

A leader in open-source language models is Cerebras, which in March released a family of language models that range in size and capability. In April, software company Databricks released a large language model named Dolly that it trained to "exhibit ChatGPT-like human interactivity." Meta also offers a language model called Llama 2, which is free to commercially license.

One common way to compare models is by the parameters — akin to neural connections in a human brain — each has. Generally, the more parameters a model has, the more capable it can be, but not always. For example, Google has said the second version of its large language model PaLM is smaller than its first version but supports more capabilities. Reportedly, PaLM 2 has 340 billion parameters while the initial version had 540 billion.

OpenAI released GPT-3 in June 2020 with 175 billion parameters. GPT-2, which came out in 2019, had 1.5 billion parameters and its predecessor GPT-1 had 0.12 billion parameters. The company has not disclosed how many parameters GPT-4 has; the historical trend suggests it has over a trillion parameters.

Elsewhere, the French-American company Hugging Face this month released the open-source Falcon 180B, which it claimed at the time of its release to be the "largest openly available language model." As its name suggests, the model has 180 billion parameters.

Indeed, other free models are not nearly as large as Falcon. The largest version of Meta's free model Llama 2 has 70 billion parameters. The largest model Cerebras released this year has 13 billion parameters. Databricks' Dolly has 12 billion, and MosaicML's MPT-30B has 30 billion.

Parameters only provide a rough estimate of how well a model might perform compared with another. Models can have advantages and disadvantages depending on the task at hand and the context in which the task is performed. This has inspired Stanford University's Center for Research on Foundation Models to benchmark the most prominent language models in what it calls the Holistic Evaluation of Language Models (HELM).

As of November, HELM had run 4,900 evaluations on 30 language models and has since run thousands more on 48 additional models, including GPT-4 and models from Cohere and Anthropic. These evaluations test 59 aspects of model performance, including various metrics of accuracy and truthfulness, efficiency, fairness and bias. HELM researchers test these metrics in 42 scenarios, including text classification tasks, news article summarization, question answering and comment toxicity detection.

HELM plans to evaluate each model in additional scenarios such as fact verification and copywriting, and it has plans to add models such as Databricks' Dolly 2 and Google's PaLM to its evaluations.

While these tests developed and run by academics provide a basis on which to compare models, many companies seek to have the evaluations tailored to their particular purposes. To provide that service, Arthur AI, a New York-based AI performance company, offers an open-source product called Bench to help evaluate LLMs "for production use cases," according to the company.

Ultimately, evaluating a language model's performance in any application is an ongoing effort that is bolstered by greater transparency from the creators of language models, according to the researchers behind Stanford's HELM, who advocate for "holistic, pluralistic and democratic benchmarks" for language models.

"Transparency begets trust and standards," said three HELM authors in an update on the project. "By taking a step towards transparency, we aim to transform foundation models from an immature emerging technology to a reliable infrastructure that embodies human values."

For reprint and licensing requests for this article, click here.
Artificial intelligence Technology Machine learning
MORE FROM AMERICAN BANKER