Designing Better Evaluations of Generative Models

The use of standard benchmarks and metrics lacks essential context. How can user-focussed evaluations help minimise harm?

Nov 11, 2023

In a previous post, “5 things that are harder than building your AI model”, I discussed the challenges that face AI/ML teams and how they might shift their focus away from model building and onto more practical implementation challenges. One of those new challenges was to ‘level-up’ evaluations of models, particularly generative AI systems. In this post I will dive into this topic in more detail.

Until now, evaluation has usually involved narrow AI systems, assessing them against a specific task. In these cases it is fairly easy to define evaluation criteria and assess performance against a benchmark. These are often set by human experts and models can be compared against these and error cases investigated.

As we see a new wave of general purpose AIs, rigorously evaluating them for all the tasks they might perform once deployed is virtually impossible. A different approach is required to make sure that models are safe and effective in the shear breadth of tasks they could be used for.

What should model evaluation achieve?

Evaluation should play a number of roles within AI organisations and should certainly not be seen as the end of the model development lifecycle. Some of the key goals a good evaluation should achieve include:

Selecting the best performing model for deployment into the product (often in combination with other models).
Give confidence in the quality of AI outputs - this can be for internal team members, end-users of the product or regulators.
Identify critical safety and ethical issues prior to release - this includes generative outputs that exhibit bias and toxicity.
Feed into future model development or finetuning - good evaluations should give actionable insights back to engineering teams.
Feed into product design and development - a strong understanding of model behaviours can inform the best way to present a system to users to ensure performance and safety.

The Current State of Generative Model Evals

(This is focussed on large language models, for more info on image models see here)

As with other AI fields, evaluation and comparison of generative models is dominated by performance against large benchmark datasets. These contain thousands of standard questions which models can be assessed against for various competencies. This works well for developers because it gives a reliable measure of performance whilst also generating direct comparisons against competitor products which can be used as marketing - see the recent release of Grok model below.

Major LLM comparisons against standard benchmark datasets - source

These benchmarks are just a handful of the available datasets, evaluating maths, programming, question answering and bias/toxicity detection. Some of the example problems from the GSM8K (Grade School Math 8k) datasets are shown below for a flavour of what models are asked in this tests.

In addition to benchmark datasets, models can be evaluated on specific tasks such as sentiment analysis, part-of-speech (POS) tagging, text generation and multilingual tasks 1. Global performance metrics can also be used across these tasks, such as:

Perplexity - a measure of model uncertainty
Bilingual Evaluation Understudy - a measure of machine translation performance
Recall-Oriented Understudy for Gissing Evaluation (ROUGE) - used to evaluate text summarisation

All of these metrics will output a single number intended to summarise a model’s capabilities for a specific task. This becomes a very broad approximation of performance which can ignore blindspots in a model’s understanding of a topic.

The Issues with Standard Evaluation Approaches:

1. Performance metrics are too broad

An obvious issue with these standard approaches is that it seems reductive to summarise performance for general purpose tasks in a single number. To give a single example of a helpful question answering assistant, we might want to evaluate their performance on a number of different dimensions. Taking into account the tone, speed of reply, factualness, reliability, understandability, potential for bias and many other factors to select the best model. (Interestingly, not all of these are solely related to model performance - if speed is important this may have consequences on the size of model it is possible to select).

We would also want to know not just how often it is likely to make errors but also what questions it makes those errors on. This is a central issue in benchmark datasets that is not true to real life - not all errors are of equal importance. Making an error in the amounts in a baking recipe is not equivalent to giving someone misleading or dangerous medical advise. We might choose a model with lower benchmark performance to avoid the latter. This challenge of reduced performance on subgroups of tasks is known as hidden stratification and is well explored in other domains such as medical imaging, where AI tools are shown to have lower performance on patients of certain ethnicities2.

2. Benchmarks can leak into training data

A second issue with benchmark datasets is that they can leak into training data. This is because the datasets themselves are freely available on the internet, which is the source for the training data of large language models. Even when the datasets themselves are excluded, there are countless forums across the internet where students will discuss either these questions or similar questions. This means there is a decent chance the models will display artificially inflated performance against popular benchmarks.

3. They lack contextual understanding

In this paper by a group at DeepMind, they discuss how social context is also essential when evaluating models for unfairness and potential harm. They show that broad assessment of harm lacks the necessary nuance and that benchmark datasets often have poorly defined demographic groups.

Instead, we should break topics like overall harm down further to consider consequences, people affected and mode of delivery. As an example, they show the difference between instance and distributional harm. With the former being more easily identified from a single prompt containing harmful text, but the latter being only possible to identify in batches, such as only ever referring to doctors as he/him.

Instance vs Distributional harm from Rauh et al.

In the specific case of distributional harm, this presents an issue because of the sheer effort that is required to elicit a harmful response. It might take the exchange of 100 messages before the model says something harmful, before which a human evaluator might give up and assume it is behaving as intended. I see this as an obvious opportunity to create more LLM-powered agents that are fine-tuned for the task of distributional harm assessment.

4. They aren’t actionable

Finally, a major issue with benchmarks and global metrics is that they aren’t actionable. It is not obvious what engineering teams should do to improve performance or where data teams should collect more data. Evaluations should have a direct line back to engineers and should output pre-packaged insights that are actionable to improve model performance. Evals should also give insight into the likely behaviour of the model and designers and product managers should work to present it to users within the product.

So how do we move towards safer, more rigorous and representative evaluations of generative models? For me this requires being more hands-on and user focussed, evaluating everything in the context of the AI model’s eventual user environment. Two methods which help achieve this are human-centred evaluation and red-teaming.

Human-Centred Evaluation

I often seen that there can be a mismatch between a developer’s expectations and an end-user’s. Probably the most common issue when evaluating diagnostic models is that doctors would consider every false positive as an error that wasted their time. Where as we viewed it as being cautious and safe by sending uncertain cases for human review. The development team thought the model was behaving exactly as intended and the users thought it was making mistakes.

The central mismatch here is in the mental models that the two groups hold with regard to the model3. These differences will also happen between subsets of users, creating a dangerous situation where some users approach outputs with less skepticism than others.

AI teams should spend time understanding how user groups perceive not just their AI but all AI technologies. This user-centric and design-led approach is often a challenge to engineers who are not used to working this way, but the ability to broaden skill sets will be essential to success in the age of generative AI. Insights gathered should then feed into designing guardrails for the system and identifying interpretability methods that would be appropriate for the audience.

One method which shows promise in this regard is delivering behaviour descriptions with the AI system4. On a simple level, this might stating that a chatbot is not 100% accurate and should not always be trusted. But more advanced approaches could include tailored behaviour descriptions that are specific to the prediction a model had just made. For example, if giving medical advice for a black patient the description could give details that fewer similar patients were included in model training and therefore physicians should take extra care when interpreting results.

Delivering behavioural descriptions to users from Cabrera et al (2023).

Red-Teaming

Red teaming, also known as jailbreaking, is the practice of trying to prompt a model to make statements that are false, harmful or biased. This technique tests both the underlying beliefs of a model trained on a murky dataset scraped from the internet and the competence of the guardrails set up around it. This is such a powerful technique because it highlights vulnerabilities in the system but also identifies possible patterns of misuse, which can feed into improvement of system guardrails.

A simple jailbreaking attempt from twitter.

As in the case of distributional harms, red-teaming can be performed by humans or by other LLMs fine-tuned to the task. This is useful in reducing the cost of performing thorough red-teaming tests and also makes it more accessible for smaller organisations who might be deploying third party foundation models but want to go more in-depth in their specific domain to test for vulnerabilities.

The approach is not without its limitations however. Evaluating how harmful a response is very subjective and it is difficult to identify distributional harms that would appear across hundreds or thousands of interactions. There is also a trade off between harmlessness and evasiveness and as models get larger they are also becoming harder to red team5.

A study from Antrhopic also showed that when red-team evaluators were recruited through popular crowdsourcing sites like Mechanical Turk, they were not representative and did not contribute equally to evaluation. They showed that 80% of attacks came from ~17% of participants. There is also the risk that human evaluators will be less inclined to ‘go deep’ on more grizzly topics - for example the figure below from the paper shows that child-abuse and self harm had the fewest successful attacks but also received the fewest total prompts from testers.

Frequency of successful red team attempts and self-rated successful attempts from https://arxiv.org/abs/2209.07858.

Conclusion

The major takeaway here is that model evals should not be seen as the final hoop for developers to jump through before their model is released into the wild. It is a central part of the development lifecycle that outputs actionable insights to engineers, designers, product management, compliance teams, marketing and many more. Teams should also invest in post-deployment evaluation as much as possible and should enlist users to help flag error cases and provide social context of how it affected them. Finally, engineers should embrace the challenge of adopting a more user-centric and design-led mindset as the metrics led approach which has been our norm can longer test the plethora of capabilities of generative models.

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., & Xie, X. (2023). A Survey on Evaluation of Large Language Models. http://arxiv.org/abs/2307.03109

Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging. Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:151-159. doi: 10.1145/3368555.3384468. PMID: 33196064; PMCID: PMC7665161.

https://www.arthur.ai/blog/the-thinking-we-havent-done-on-llms

Ángel Alexander Cabrera, Adam Perer, and Jason I. Hong. 2023. Improving Human-AI Collaboration With Descriptions of AI Behavior. Proc. ACM Hum.-Comput. Interact. 7, CSCW1, Article 136 (April 2023), 21 pages. https://doi.org/10.1145/3579612

Ganguli et al: Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (2022) https://arxiv.org/abs/2209.07858

Tom’s Substack