In this concluding article on the international scientific panel’s interim report on AI safety, we examine the panel’s proposals for technical mitigants to AI risks. Two weeks ago we reviewed the panel’s assessment of AI capabilities and future trajectories and last week we reviewed the panel’s risk assessment.
Regulating AI is like nailing jelly to the wall
The interim report kicks off its discussion of risk mitigation with an analysis of seven inherent technical characteristics of general-purpose AI which makes mitigation so difficult, if not near-intractable (the report more euphemistically calls these “cross-cutting technical risk factors that each contribute to many general-purpose AI risks”).
1. General-purpose AI systems can be applied in many ways and contexts, making it hard to test and assure their trustworthiness across all possible use cases.
The obvious problem is because general-purpose AI can be utilised in various ways, often beyond the contemplation of the developers, there are challenges with exhaustively identifying and testing the downstream uses. However, there are also other less obvious, but fundamental issues including:
general-purpose AI systems’ outputs are often open-ended, such as free-form dialogue, which makes it difficult to design in advance guardrails which can comprehensively identify and override unacceptable content in all the different versions of user prompts and LLM output;
even carefully designed guardrails can be easily circumvented by users (called ‘jailbreaking'): e.g. manually by a series of prompts which coax LLMs into overriding their safety guardrails or automatically by using one or two small LLMs for adversarial attacks on a larger LLM.
2. General-purpose AI developers have a highly limited understanding of how general-purpose AI models function internally.
Most AI capability comes from what the AI ‘discovers’ for itself from the training data, not from top-down design by humans. Unlike previous generations of technology, “general-purpose AI models do not come with blueprints, and their structure does not conform to common design principles”.
3. Ensuring general-purpose AI systems pursue the goals intended by their developers and users is difcult.General-purpose AI systems can appear to excel at learning what they are ‘told’ to do, but at times can behave in ways - good and bad - which surprise developers:
Although a general-purpose AI model may give the ‘correct’ feedback during training, once deployed it may produce a solution that does not relate to the novel situations which it confronts in the real world - both because the training data does not represent the real world very well and the AI has no way to relate to the real world.
Chatbots, even if originally well-trained, can display ‘sycophantic’ tendencies once deployed to reflect what they learn about the user’s preferences.
AI models may take shortcuts, make decisions that are harmful to humans or ‘grab power’ by commandeering resources, apps and tools where the AI model determines this will advance the ultimate goal to which they are directed (the AI version of 'the ends justifies the means').
4. Because general-purpose AI systems can proliferate rapidly, like other software, a new fault or harmful capability can rapidly have a global and sometimes irreversible impact.
How this plays out can differ between open and closed-source models:
Once an AI model is made available open source, there is no practical way to erase the model from the market if it has unacceptable faults or risks. On the other hand, as so many developers can ‘get under the hood’ of open-source models, problems can be detected and fixed much quicker.
While developers have higher quality control over closed-source AI models, including post-release fixes, they are still rapidly taken up by huge customer bases: ChatGPT set a record for consumer IT take-up with 100 million users in two months of release.
5. Despite attempting to debug and diagnose, developers are not able to prevent overtly harmful behaviours across all circumstances in which general-purpose AI systems are used.
While policymakers have called for guardrails to ban unacceptable behaviour in all circumstances (e.g. implemented through binding hyperparameters which override the parameters set from the training data), the interim report observes that:
general-purpose AI development fails to meet a lower standard than this: ruling out any specic overtly harmful behaviour across foreseeable situations.
6. General-purpose AI agents have early forms of many autonomous capabilities, but lack reliability at performing complex tasks autonomously. Difficult as the above problems are, developers are increasingly focused on building general-purpose AI ‘agents’ which reduce the need for human involvement and oversight.
7. Although current general-purpose AI agents are unreliable, advancement has been rapid. This is a polite way of saying the industry is rushing ahead at full speed in developing advanced capability without a clear solution to the above problems.
If you thought these technical issues made regulation hard, there’s more...
The interim report goes on to argue these technical challenges are compounded by the following societal and market factors.
1. General-purpose AI developers, who are competing for market share in a dynamic market where getting a product out quickly is vital, may have limited incentives to invest in mitigating risks. The interim report says there are two ‘races to the bottom’:
by developers competing “to develop general-purpose AI models as quickly as possible while under-investing in measures to ensure safety and ethics”; and
by countries trying to attract or promote AI investment “through lax regulation that might be insufcient for ensuring safety domestically and abroad.”
2. As general-purpose AI markets advance rapidly, regulatory or enforcement efforts can struggle to keep pace.
3. General-purpose AI systems’ inherent lack of transparency makes legal liability hard to determine, potentially hindering governance and enforcement.
How is liability to be allocated between the multiple parties involved in AI model development - data providers, model trainers, fine tuners and deployers? If we don’t understand how AI really works, how can we trace back to which of these parties is responsible for errors? If AI has “emergent capabilities” of its own creation, who can be held to blame for the problems they cause?
4. It is very difcult to track how general-purpose AI models and systems are trained, deployed and used.
The interim report notes while comprehensive, standardised safety governance is common in safety-critical elds such as automotive, they are missing in AI, domestically and globally.
Are there technical ‘fixes’ to the challenges of AI governance?
With those two mountains to climb, the interim report turns to potential technological solutions. However the interim report does not identify any ‘silver bullets’ and, somewhat dispiritingly, characterises many solutions as not reliable, not well-studied, not universally adhered to, their viability and reliability is limited, may introduce unwanted side effects, nascent and not widely applied yet, or constrained by human error and bias.
The interim report suggests AI developers draw on risk assessment, risk management and design processes used in other safety-critical industries, such as nuclear and aviation. These methodologies put the burden of proof on the developer to demonstrate their product does not exceed the maximum risk thresholds set by the regulator. However, the interim report acknowledges, while useful, there are challenges in translating such measures to AI because:
Quantitative risk assessment methodologies for general-purpose AI are very nascent and it is not yet clear how quantitative safety guarantees could be obtained. Experience of other risk assessments in general-purpose AI suggests that many areas of concern may not be amenable to quantication (for example, bias and misinformation).
The interim report also suggests investment in richer human feedback in the training process could assist in meeting the challenge of ensuring general-purpose AI systems act in accordance with the designers’ intentions and better reflect the real world (called alignment). But this is very resource-intensive and is only as good as the humans involved. Even if these human failings can be mitigated, still as the interim report observes:
...the amount of explicit human oversight used during netuning is very small compared to the trillions of data points used in pre-training on internet data and may, therefore, be unable to fully remove harmful knowledge or capabilities from pre-training.
The interim report also poses the very good question of how can humans oversee the training of AI which has exceeded humans' capabilities? The report notes there is some research, very preliminary, which suggests that “less capable general-purpose AI systems [may be able] to oversee more capable ones under the assumption that similar dynamics may exist for general-purpose AI exceeding human capabilities.” But this raises the question of whether the smarter AI might work out if it is being trained by a dumber AI, and ‘game’ or take control of the lesser AI.
The interim report also favours improving the robustness of AI models through adversarial training which consists rst, of constructing ‘attacks’ designed to make a model act undesirably and second, of training the system to handle these attacks appropriately. But, as the interim report points out, adversarial training requires examples of a failure and “the exponentially large number of possible inputs for general-purpose AI systems makes it intractable to thoroughly search for all types of attacks”.
Adversarial attack learning may make the AI model more vulnerable to types of attacks not included in its learning. One possible solution is to focus the adversarial attacks on the internal workings of the AI model and not outputs (as currently is the case), but that comes back to the problem that we don’t fully know how AI works internally!
Techniques to explain why deployed general-purpose AI systems act the way they do are nascent and not widely applied yet, but the interim report considers that there are some helpful methods. While simply asking AI models for explanations of their decisions tends to produce misleading answers, there is an emerging science of prompt engineering, such as ‘chain of reasoning’ which requires the AI to step through how it got to its answer.
The interim report takes a more sceptical view of the capacity of the following technologies to improve AI safety:‘Machine unlearning’ can be used to remove certain undesirable capabilities from general-purpose AI models, such as how to build a nuclear weapon, but the interim report says “unlearning methods can often fail to perform unlearning robustly and may introduce unwanted side effects”.
Understanding a model’s internal computations might help to investigate whether they have learned trustworthy solutions (called ‘mechanistic interpretability’), but the interim report points out that “research on thoroughly understanding how AI systems operate has been limited to ‘toy’ systems that are much smaller and less capable than general-purpose AI models”.
Mandating technological measures to identify AI-generated content (or conversely human-generated content) could help with misinformation, but the interim report notes that these measures, such as watermarking, are readily circumvented and, given the nature of AI training, can produce many false positives:
...because AI systems tend to memorise examples that appear in their training data, so common text snippets or images of famous objects may be falsely identied as being AI-generated.
As to how to address bias, the interim report observes “[i]t is debated whether general-purpose AI systems can ever be completely ‘fair’”, but there are techniques that can reduce factors such as bias. An AI which is less biased (the teacher) can oversee the training or fine-tuning of another AI (the student) or an AI can be fed both biased and unbiased scenarios to learn the difference between them.
But the interim reports say these kinds of measures can only go so far. AI is not good at making more subtle trade-offs, nuanced understandings or qualitative judgements which inform human fair-mindedness. Sometimes the efforts to teach AI how to reflect or balance competing factors informing fairness can produce ridiculous results. For example, generated images of Indigenous People and women of colour as US senators from the 1800s, and ‘ethnically diverse’ World War II-era German soldiers.
As to privacy, the interim report’s assessment is that “current privacy-enhancing technologies do not scale to large general-purpose AI models". The interim report is also not convinced synthetic data is the answer because to make the value of synthetic data high it “may carry as much information as the original data”.
The interim report suggests it may be more useful for measures to address the lack of data transparency and control used in other areas, such as user-friendly interfaces for managing data permissions, implementing secure data provenance systems to track how data is used and shared, and establishing clear processes for individuals to access, view, correct, and delete their data. The interim report also suggests changing the incentives between developers and users by:
...redistribut[ing] the wealth derived from personal data in a more traceable and equitable manner, for example by using economic tools for data valuation."
Around the world, there are proposed regulatory responses requiring AI systems to be developed and deployed in a manner that respects privacy principles, such as by enforcing data minimisation and purpose limitation. But the interim report comments “yet how to achieve these properties, or the extent to which they are achievable is questionable".
The interim report concludes that there is no real technical solution, and is unlikely to ever be, to the challenge of deep fakes.
Is ‘Swiss Cheese’ the answer?
The interim report concludes “the general-purpose, rapidly evolving, and inscrutable nature of highly capable AI models makes it increasingly difcult to develop, assess, and incentivise systematic risk management practices”. As no individual approach is or is likely to become effective, the interim report suggests the following approach:
Effectively managing the risks of highly capable general-purpose AI systems might therefore require the involvement of multiple stakeholder groups, including experts from multiple domains and impacted communities, to identify and assess high-priority risks. Further, it suggests that no single line of defence should be relied upon. Instead, multiple independent and overlapping layers of defence against those risks may be advisable, such that if one fails, others will still be effective. This is sometimes referred to as the Swiss Cheese model of defence in depth.
But, at the risk of overworking the analogy, will there be more holes than cheese if the capabilities of general-purpose AI continues to rapidly advance?
What’s next?
The panel will consult on the interim report and provide a final report at the next international AI shindig in Paris in February 2025.
Read more: International Scientific Report on the Safety of Advanced AI
Peter Waters
Consultant