Webinar: How to Enhance Carrier Decisioning through Collaborative Ecosystems with Guidewire and Unqork
Register Now
  Everest Group IDP
             PEAK Matrix® 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)
Access the Report

LLM BENCHMARKS

Large Language Model Benchmarking for Extractive, Classification, and Predictive Tasks

Welcome to the ultimate resource for comparing large language models. Here, we meticulously analyze and present the accuracy, speed, and cost-effectiveness of leading models across critical tasks such as information extraction, clause classification, and summarization.

Indico Data has been a guiding force in the AI industry since its inception, consistently emphasizing practical AI applications and real customer outcomes amidst a landscape often clouded by overhype. Indico was the first in the industry to deploy a large language model-based application inside the enterprise and the first to integrate explainability and auditability directly into its products, setting a standard for transparency and trust.

While the vast majority of LLM benchmarking is focused on chatbot-related tasks, Indico recognized the need to understand the performance of large language models for more deterministic tasks such as extraction and classification, and further to understand the performance and costs based on assumptions related to context length and task complexity.

Leaderboard

Indico Data runs a monthly benchmarking exercise across providers (LLama, Azure OpenAI, Google, AWS Bedrock, and Indico trained discriminative standard language models RoBERTa and DeBERTa), datasets (e.g. cord and CUAD), and capabilities (text classification, key information extraction, and generative summarization). The table below ranks the accuracy (F1 score) of these models for each capability averaged over datasets and prompt styles. The "Accuracy" page contains the same information at a much more granular level of detail.

Accuracy Comparisons

Full details of this month's benchmarking run across models, capabilities, prompt styles. This information is meant to facilitate decision making when trying to decide the best model for a given task. For example, if missed information in your process is expensive, then
you should choose a model with high recall.

Clear All

Green means better than average, red means worse than average, orange is average.
The size is how far above/below average the model is.

Cost of Ownership

Gain insights into not just how well each model performs, but how fast and cost-efficiently they do it.

Plotted below are the tradeoffs between accuracy (F1 score) and cost and accuracy and response time by model for all capabilities, datasets, and prompt styles.

Extraction Datasets

CORD

COnsolidated Receipt Dataset for post-OCR parsing.

Original source: https://github.com/clovaai/cord

From the authors:
…The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing...

Kleister NDA

Non-disclosure agreements, published by Applica AI.

Original source: https://github.com/applicaai/kleister-nda

From the authors:
Extract the information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, etc...

Charity Reports

Charity financial reports, published by Applica AI.

Original source: https://github.com/applicaai/kleister-charity

From the authors:
The goal of this task is to retrieve charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP in PDF files published by British charities...

Classification Datasets

CUAD

Contract Understanding Atticus Dataset

Original source: https://www.atticusprojectai.org/cuad

From the authors:
...a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review...

Resource Contracts

ResourceContracts is a repository of publicly available oil, gas, and mining contracts

Original source: https://www.resourcecontracts.org/

Indico retrieved hundreds of contracts from this repository and labeled key information including names, organizations, section orders, and full clauses (used in this classification task).

Contract NLI

Legal language classification into three classes: Entailment, Contradiction or NotMentioned

Original source: https://stanfordnlp.github.io/contract-nli/

From the authors:
ContractNLI is the first dataset to utilize NLI for contracts and is also the largest corpus of annotated contracts (as of September 2021)...

Summarization Datasets

SciTLDR

Extreme Summarization of Scientific Documents

Original source: https://github.com/allenai/scitldr

From the authors:
We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language.

Prompt Styles: Extraction

Documents are first split into overlapping chunks of roughly 1200 tokens and then those chunks are injected into the following prompt structure:

  • System prompt: You are a skilled human knowledge worker whose task is to extract key information from text.
  • Extraction instructions: """Instructions:
    • Find the data elements in the document that match the instructions for the fields provided.
    • Do not calculate or infer anything. Answers should be copied directly from the Document with no modification or formatting changes.
    • Answer "N/A" if no perfect matches are available. (Note, the majority of responses will be N/A)
    • Output your answer(s) as a bulleted list. - DO NOT number your answers."""
  • Finally, the LLM is fed a document chunk c and descriptions of the fields to be extracted.

Prompt Styles: Classification

There are six distinct classification prompt styles applied to each chunk c and class list, but they all share a common backbone:

  • Extraction instructions: """Instructions:
    • You are a skilled human knowledge worker whose task is classify text.
    • Please classify this text:\n------------------\n{c}\n--------------\ninto one of these categories: {class_list}\
    • Respond with one category only."
  • The variants:
    • No description of the classes
    • With description of the classes
    • Prompted to include rationale (with and without descriptions): Show your workings and then answer using the format 'Answer:
    • Prompted under duress with rationale (with and without descriptions): I am under serious pressure to get this right and may lose my job if I don't. Please help me.

Better data.
Better decisions.

Subscribe to our newsletter

Get started with Indico

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.