5 Things Harder Than Building Your AI Model
As we come to rely more on foundation models as a service, what should the average AI team focus on to make their product stand out from the crowd?
In 2012, a Harvard Business Review article described the role of ‘Data Scientists’ as the “sexiest job of the 21st century”1. In the past decade, that prophecy has rung true, Data Scientists and Machine Learning Engineers have been in high demand and have dominated the discourse around AI development. These roles share a technicality, framing the discussion of AI as inherently practical. This was important because the central dogma of AI projects was, “can you get a model to work?”. Without this expertise you couldn’t make progress.
However, a combination of open source frameworks and super-capable, third-party foundation models provided as a service, mean the focus has shifted away from building from scratch2. The effort required to make a tool ‘work’ is decreasing all the time. So, with less investment required to build models, what should AI teams focus their attention on?
1. Finding the right problem to solve, at the right time
Reduced development costs should empower engineers, product managers and domain experts to focus on their product’s value add. This means getting closer to the problem you are trying to solve and not being blinded by AI hype. “We’ll give them an AI” is not the right approach. Below are 4 key questions to ask before diving into building the solution:
What is the status quo? Are you introducing something new or replacing an existing process delivered by humans? If the latter, how good are the humans, how much do they make errors and how much do they cost? What are their specific value adds? Underestimate the human expert at your peril…
How good is ‘good enough’? The answer to this will require a strong understanding of your domain and the costs of errors. Are you in a low stakes environment where speed trumps high accuracy or are models rarely tolerated?
How hard is it to get there? What is the performance gap between a quick and dirty prototype and a finished product? How much time, effort and money will it require to close that gap?
What is coming down the road? Will the next OpenAI release make building your solution half as difficult? Or worse, will it make your product obsolete?
Many AI products pursue use cases because they seem impressive initially, but underestimate the effort required to reach reliable and usable performance or find the market just never existed. This was evident in medical imaging, where many early products targeted pneumothorax on chest x-rays3 (probably the most life threatening illness you can diagnose on this type of scan). In reality though, a pneumothorax tends to occur in less than 1 in 1,000 scans and even of those, radiologists are impressively adept at spotting them. In this case, a model could only provide value if it reliably never made an error and even then probably wouldn’t be worth much investment by a hospital. It can be easy to get excited about a prototype that could automate a human professional 90% of the time. But the value add of human experts is often in the long tail of the 10% - where AI systems historically struggle.
2. Levelling Up Evaluations
Evaluation and validation has been standard fare for data scientists for a long time. However, too often this focuses on a few standard metrics, with little analysis of the behaviour of an AI system.
What does model behaviour mean? This is more nuanced than performance and includes the trade-offs in a system, between performing well on some groups and worse for others. For example, a system that identifies bank fraud might be optimised to catch large transactions, at the expense of lower value banking clients. Every AI system has trade-offs like this and developers should be wary of showing better performance on a performance metric like accuracy without considering the hidden changes in model behaviour.
This is even more pressing in the age of generative AI, where we are less and less in control of the outputs of our models but instead aim to set up guardrails around them. Evaluations of the next generation of AI systems should look like a combination of field tests and red-teaming.
Field tests because every potential error and harmful output should be considered in the context of its environment. Will a user be able to identify an error and discard it or will they act on good faith? This should feed directly into both product management and UX design, helping prioritise development issues and ensure that systems ‘fail gracefully’4. Some of the most productive time I’ve spent as an ML engineer has been looking over the shoulder of users.
Red teaming should be incorporated into model development as our systems become increasingly sophisticated and relied upon. Deliberately trying to trip models up in a controlled setting can help establish key guardrails for the eventual deployment systems, reducing both deliberate and inadvertent misuse.
3. Monitoring and Iterating
The majority of AI teams already have monitoring pipelines and tools. These will likely feed into dashboards with prediction distributions and response latency. These traditional frameworks are built to identify cohort level issues, such as distribution shift or downtime with the model service. However, they lack the ability to perform deep dives into error cases and perform root cause analysis.
A central problem here is organisational acceptance that models will make errors. This is always the case, and teams should be prepared to encounter edge cases and issues that were not discovered during testing.
Post-deployment testing should be the natural continuation of pre-deployment field testing and red-teaming. Effective systems to collect errors after release, similar to adverse event reporting for the pharma industry5, should be enforced. This also has the added benefit of not allowing large AI providers to hide behind internal testing and empowers users to engage with the improvement of systems.
This is a win-win for development teams, as users can help to gather context and severity of error cases. This helps with the prioritisation for development in response to issues. Knowing how to fix issues when you might not own the source code of your model backbone will be a growing challenge for technical teams and a skill that AI engineers will have to develop.
4. Building Trust
Any efforts to build trust should first be rooted in a deep understanding of users and the specific attitudes and apprehensions they might hold. Carefully designing their interaction with an AI system and the framing of its outputs will go a long way to promoting trust in the outputs.
More technical research around trust (aka alignment) focuses on the issue of explainability, helping users understand why a system made the decision it did. This becomes increasingly difficult as the size and complexity of models grows, making a decision impossible to quickly explain without rendering the system useless. A talkative self-driving system that first explains what it has seen before applying the brakes isn’t going to be very safe.
However, often a level of explainability will be essential for building user trust. It is important for this to be identified early in development processes so that minimum explainability levels can be set and the appropriate model architectures selected for development. It remains to be seen what explainable frameworks will emerge for third-party foundation models, but functions like supplying references for information will be essential.
Outside of technical explainability, trust can be built through careful design of the ways in which users interact with AI systems. Presenting errors openly and honestly, as well as making it clear that a user is interacting with an AI system and establishing appropriate expectations for that interaction.
5. Governance
Governance processes are often neglected until a problem arises or several AI products already exist within a company. However establishing how you intend to build, manage and monitor your AI tools and teams early can help avoid several major challenges in the development process, some of the major benefits of establishing governance early include:
Bringing together scattered AI efforts - It is often difficult to coordinate the activities of multiple research teams and senior managers will rarely have in-depth expertise of the methodologies being implemented. Governance processes can identify opportunities for cross-pollination and resource sharing.
Establishing a position on transparency and explainability - openly stating the minimum (and maximum) requirements of models on these topics gives research teams goalposts to work with and reduces the risk of wasted effort.
Managing project sprawl - development and research efforts can often scale and go down rabbit holes. Avoiding the temptation to implement the newest method or model can sometimes be crucial to delivering ROI from AI development.
Assessing privacy and data risks - bringing together key stakeholders within an organisation offers a chance to reflect on risks of new AI projects and initiatives.
Maintaining compliance - an increasing number of AI initiatives are set to come under regulatory scrutiny in some respect. The interface between compliance and engineering teams can often become a conflict point. Establishing expectations early and developing within an internal set of AI principles can help to avoid blockers down the road.
Wrapping up
AI teams are working on rapidly shifting ground. The functionalities achievable are increasing exponentially with the utilisation of foundation models. At the same time, governments are starting to realise the need for broader regulation to ensure powerful systems are used responsibly.
These shifts mean AI teams will need to encompass a wider range of skills and backgrounds, with less emphasis on technical ability and more focus on end users with AI-savvy product managers, designers and subject experts.
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2799574
https://pair.withgoogle.com/chapter/errors-failing/
https://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-public-dashboard