Large Language Model Benchmarking

LLM BENCHMARKS

Large Language Model Benchmarking for Extractive, Classification, and Predictive Tasks

Welcome to the ultimate resource for comparing large language models. Here, we meticulously analyze and present the accuracy, speed, and cost-effectiveness of leading models across critical tasks such as information extraction, clause classification, and summarization.

Indico Data has been a guiding force in the AI industry since its inception, consistently emphasizing practical AI applications and real customer outcomes amidst a landscape often clouded by overhype. Indico was the first in the industry to deploy a large language model-based application inside the enterprise and the first to integrate explainability and auditability directly into its products, setting a standard for transparency and trust.

While the vast majority of LLM benchmarking is focused on chatbot-related tasks, Indico recognized the need to understand the performance of large language models for more deterministic tasks such as extraction and classification, and further to understand the performance and costs based on assumptions related to context length and task complexity.

Leaderboard

Indico Data runs a monthly benchmarking exercise across providers (LLama, Azure OpenAI, Google, AWS Bedrock, and Indico trained discriminative standard language models RoBERTa and DeBERTa), datasets (e.g. cord and CUAD), and capabilities (text classification, key information extraction, and generative summarization). The table below ranks the accuracy (F1 score) of these models for each capability averaged over datasets and prompt styles. The "Accuracy" page contains the same information at a much more granular level of detail.

Accuracy Comparisons

Full details of this month's benchmarking run across models, capabilities, prompt styles. This information is meant to facilitate decision making when trying to decide the best model for a given task. For example, if missed information in your process is expensive, then
you should choose a model with high recall.

Clear All

Green means better than average, red means worse than average, orange is average.
The size is how far above/below average the model is.

Cost of Ownership

Gain insights into not just how well each model performs, but how fast and cost-efficiently they do it.

Plotted below are the tradeoffs between accuracy (F1 score) and cost and accuracy and response time by model for all capabilities, datasets, and prompt styles.

Extraction Datasets

CORD

COnsolidated Receipt Dataset for post-OCR parsing.

Original source: https://github.com/clovaai/cord

From the authors:
…The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing...

Kleister NDA

Non-disclosure agreements, published by Applica AI.

Original source: https://github.com/applicaai/kleister-nda

From the authors:
Extract the information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, etc...

Charity Reports

Charity financial reports, published by Applica AI.

Original source: https://github.com/applicaai/kleister-charity

From the authors:
The goal of this task is to retrieve charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP in PDF files published by British charities...

Classification Datasets

CUAD

Contract Understanding Atticus Dataset

Original source: https://www.atticusprojectai.org/cuad

From the authors:
...a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review...

Resource Contracts

ResourceContracts is a repository of publicly available oil, gas, and mining contracts

Original source: https://www.resourcecontracts.org/

Indico retrieved hundreds of contracts from this repository and labeled key information including names, organizations, section orders, and full clauses (used in this classification task).

Contract NLI

Legal language classification into three classes: Entailment, Contradiction or NotMentioned

Original source: https://stanfordnlp.github.io/contract-nli/

From the authors:
ContractNLI is the first dataset to utilize NLI for contracts and is also the largest corpus of annotated contracts (as of September 2021)...

Summarization Datasets

SciTLDR

Extreme Summarization of Scientific Documents

Original source: https://github.com/allenai/scitldr

From the authors:
We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language.

Prompt Styles: Extraction

Documents are first split into overlapping chunks of roughly 1200 tokens and then those chunks are injected into the following prompt structure:

System prompt: You are a skilled human knowledge worker whose task is to extract key information from text.
Extraction instructions: """Instructions:

Find the data elements in the document that match the instructions for the fields provided.
Do not calculate or infer anything. Answers should be copied directly from the Document with no modification or formatting changes.
Answer "N/A" if no perfect matches are available. (Note, the majority of responses will be N/A)
Output your answer(s) as a bulleted list. - DO NOT number your answers."""

Finally, the LLM is fed a document chunk c and descriptions of the fields to be extracted.

Prompt Styles: Classification

There are six distinct classification prompt styles applied to each chunk c and class list, but they all share a common backbone:

Extraction instructions: """Instructions:

You are a skilled human knowledge worker whose task is classify text.
Please classify this text:\n------------------\n{c}\n--------------\ninto one of these categories: {class_list}\
Respond with one category only."

The variants:

No description of the classes
With description of the classes
Prompted to include rationale (with and without descriptions): Show your workings and then answer using the format 'Answer:
Prompted under duress with rationale (with and without descriptions): I am under serious pressure to get this right and may lose my job if I don't. Please help me.

Featured LLM Resources

Large language models are the driving force behind the generative AI boom of 2023. However, they've been around for a while - and we know a thing or two about them.

Since our founding in 2014, Indico has been on the forefront of innovation in unstructured data and intelligent document processing, with a leadership team that brings years of experience deep expertise in artificial intelligence and machine learning-powered solutions.

Blog

Future-proofing insurance: Navigating AI, data science, and regulatory landscapes with Apollo 1971’s Joe Curry

Webinars

Leveraging LLMs and automation in insurance: A webinar recap

Webinars

How carriers are leveraging large language models (LLMs) and automation to drive better decisions

Blog

Simpler labeling means faster time to value for insurance intelligent automation models

Blogs

A sure fix for a vexing insurance automation problem: importing data from table

Blogs

The risks and rewards inherent in ChatGPT and generative AI

Blogs

Unleashing efficiency: AI-powered document intake for managing general agents

Better data.
Better decisions.

Subscribe to our newsletter

Get started with Indico

Schedule
1-1 Demo

Schedule

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Enterprise leaders discuss how to unlock value from unstructured data.

Check out our YouTube channel to see clips from our podcast and more.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Performance

Analytics

Others

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

LLM BENCHMARKS