Large Language Model Evaluation Metrics
The most common evaluating metrics for large language models are:
Perplexity
BLEU
ROUGE
BERTScore
COMET
METEOR
BLEURT
GPTScore
PRISM
BARTScore
G-Eval
Human Evaluation
Evaluating large language models(LLMs) is extremely difficult due to the fact that they can perform a myriad of tasks.
Perplexity
Perplexity measures how good a model is at predicting the next word. The lower the score the better, hence a higher score means the model is performing poorly at coming up with the next word. Therefore, the objective is to minimize the language model's perplexity. The English synonym for perplex is baffle or confuse. Hence, a model that's good at predicting the next token is not baffled or confused.
The perplexity metric is better suited for auto-regressive models that generate text than masked language models such as BERT used for classification. The metric is computed as the exponentiated average exponential log-likelihood of a sequence. Since perplexity measures how well the model predicts the next token, it goes without saying that the tokenization process also affects the model's perplexity.
The perplexity of a large language model is closely related to cross-entropy. The cross-entropy loss is a common loss function in classification problems. The task of predicting the next word given a set of words can also be considered a classification problem. In this case, we can use the cross-entropy loss by taking its exponent, making it, perplexity.
ROUGE
Recall-Oriented Understudy for Gisting Evaluation( ROUGE) measures the quality of a translated text. As the name suggests the metric is based on recall. The metric can also be used for assessing summarized text.
ROUGE works by comparing the translated or summarized text with human-produced text. It works by checking the overlap in the sequences generated by humans and the language model.
Specifically, ROGUE-N works by analyzing the overlap in the n-grams, ROUGE-L in the longest sequences, and ROUGE-S in the skip bigrams.
BLEU
Bilingual Evaluation Understudy (BLEU) is used to evaluate the quality of machine-translated text. The quality is better if the language model's output is closer to the output of a human. A brevity penalty is introduced in the BLEU formula because it tends to give a higher score to shorter predictions.
The score is a number between 0 and 1. It works by comparing the translated text with a set of high-quality human translations. 0 means that there is no overlap between the machine translated text and the machine text and 1 means that there is perfect overlap. The limitation of BLEU is that it relies on the n-gram overlap.
Since the BLEU metric performs better on the entire corpus, it shouldn't be used for evaluating individual sentences. The method does not capture the grammaticality and meaning of a word. BLEU is also affected by the choice of normalization and tokenization methods.
BLEU works by checking the overlap between single words, meaning that it's mainly concerned with n-gram precision where unigram is a single word. Precision is computed for each n-gram and then averaged without taking recall into consideration. Since BLEU doesn't factor in recall, it adds a brevity penalty to compensate.
BERTScore
BERTScore is a pre-trained BERT-based method used for evaluating summaries and translations. The proponents of metric also tested it on image captioning. BERTScore also computes precision, recall, and F1 measure metrics that can also be used for evaluating text generated by language models.
BERTSCORE calculates a similarity score for each token from the generated sentence given a reference sentence using contextual embeddings. The computation is done as a sum of cosine similarities between the token embeddings of two sentences. BERTSCORE has been shown to have results that match human evaluations.
COMET
Like the BERTScore, COMET(Crosslingual Optimized Metric for Evaluation of Translation) uses a pre-trained language model to generate scores for candidate sentences, making it possible for the metric to be solely dependent on n-grams. The use of language models enables the metric to benefit from the linguistic capacity of the model as learned from the large training corpus. However, using LLMs can be a problem since some of them may be insensitive to the polarity of the sentence and word order.
COMET is a set of models that can be used for machine translation. You can also train custom models. The models give a score of between 0 and 1, with a score closer to 1 indicating better translation quality. COMET is a PyTorch-based framework for evaluating language models based on human judgments such as Direct Assessments (DA), Human-mediated Translation Edit Rate (HTER), and metrics compliant with the Multidimensional Quality Metric framework.
METEOR
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric that requires training and relies on token alignment. METEOR evaluates machine-translated text from the reference text by unigram matching with the human text. The matching is done based on unigram-precision and unigram-recall.
Unlike, BLEU, METEOR uses recall in its computation to understand the number of matched n-grams out of the total number of n-grams in the human translation. METEOR works by creating an alignment between the machine-translated text and human reference text. The alignment is a mapping between the two strings such that a unigram– a single word– in one string is matched to 0 or one unigrams in the other but not unigrams in the same strings.
GPTScore
GPTScore uses generative pre-trained models to score the generated text. The evaluation framework for GPTScore is shown in the following image.
The method works by using a language model to evaluate a given sentence based on a certain evaluation protocol. The protocol is defined from the task and how easy it is to understand the generated response. As the metric name suggests, some of the language models used are GPT2 and GPT3 due to their superior zero-shot classification abilities. Given a certain instruction and context, the language model will give a bigger score text with higher quality.
The GPTScore is defined as shown in the previous figure where:
wt
is the weight of the token at positiont
T(·)
is a task-dependent prompt template defining the evaluation protocold
is a task descriptiona
the aspect definition andh
is the evaluation text
Human Evaluation
Human evaluation involves getting human evaluators to assess the output of language models for:
Quality
Fluency
Coherence and
Relevance
This is a very good method for evaluating the language model but it is subjective to human biases and opinions.
Final Thoughts
Using language models to automate the process of evaluating large language models is a big step in assessing them but can be harmed by the inherent biases in the training corpus of the models. Other metrics that don't rely on language models are also not bulletproof and getting the best results will depend on the use case and may call for a hybrid approach.
It may be best to use several metrics and compare how the language model performs on them. It is also important to consider the tokenization methods used because they also affect how the model performs on the different metrics.