National Institute of Standards and Technology campus in Gaithersburg, Maryland. J. Stoughton/NIST. Source
As the state of the art in artificial intelligence (AI) continues to advance, the Biden Administration has tasked the National Institute of Standards and Technology (NIST) with leading the development of guidelines to ensure the responsible and safe deployment of AI technologies. Central to this mandate is the creation of comprehensive testing environments and protocols for AI red teaming. This effort ventures into the murky territory of how to define safety and transparency in the context of AI models. The risks are especially great given that AI may be used in a variety of sensitive applications.
Demystifying Red Teaming in AI
Historically rooted in cybersecurity, the essence of red teaming is to adopt an adversary’s perspective to investigate and identify vulnerabilities within an organization’s security framework. However, red teaming in AI is nuanced from its cybersecurity origins. In AI, red teaming occupies a unique place in the security testing spectrum. It doesn’t just push models to their limits through direct attacks. Instead, it takes a more exploratory approach to uncover possible intentional and accidental misuse of the model. This process involves simulating potential misuses that an AI system might permit when faced with edge or fringe inputs, which can uncover unknown failure modes and other dangers.
By conducting these tests prior to deployment in systems, safeguards and precautions can be implemented to make AI models more secure upon release. Such an approach is important for identifying cases where AI output may pose a threat, such as generating sensitive or potentially dangerous information. For example, OpenAI published research on a red team study that investigated whether GPT-4 has enhanced access to information about known biological threats.
Contrasting views: Pandora’s box or protective shield?
As NIST attempts to establish red teaming benchmarks, it faces conflicting opinions and practices within the industry. Currently, companies employ a variety of strategies to assemble red teams, ranging from open invitations to finding experts to leveraging crowdsourced workers from platforms such as Amazon Mechanical Turk and TaskRabbit. There is also a lot of uncertainty about what to disclose, how to disclose it, and to whom. Without stricter disclosure standards, companies are left to their own discretion in how transparent they are in sharing the results of these tests and in revealing how they tuned their models to fix identified vulnerabilities.
The fundamental dilemma lies in a disclosure paradox. On the one hand, publishing red team findings through model cards and academic papers promotes transparency, spurs discussion to reduce potential harms in AI models, and ensures accountability within the industry to develop safer models. On the other hand, disclosing vulnerabilities discovered in red teams may inadvertently provide adversaries with a blueprint for exploitation. This is why we limit the publication of red team data. This is reminiscent of a broader debate within academia about the trade-off between open scientific research and the risk of empowering harmful actors by disclosing sensitive information. So how can NIST’s upcoming standard navigate this complexity and establish a secure framework for disclosure and accountability?
First, it is important to clearly define the ethical limits of responsible disclosure. There is precedent in the world of scientific research regarding the risks associated with releasing information, particularly about biohazards. Kevin Esvelt of the MIT Media Lab describes this issue as a “tragedy of the commons” scenario, where unrestricted information sharing introduces collective risk. Information hazards in the context of AI models and red teams are an abstraction of this same challenge. The danger here is that detailing previous techniques used to exploit or “jailbreak” a model may itself be an information hazard.
Looking forward: Towards a comprehensive standard for red teaming
As NIST develops its AI Red Teaming standards, it faces a choice between fostering a culture of openness and transparency in the reporting process and addressing legitimate concerns related to various risks and dangers. How much information to make public and how accountable to hold companies for remediating vulnerabilities identified in testing will need to be carefully evaluated.
Going forward, it is essential that the standard incorporates several key considerations to effectively address the complexities of red teaming.
Safe Testing Framework: There is agreement among members of the Frontier Model Forum (an industry group consisting of Anthropic, Google, Microsoft, and OpenAI) that red teaming exercises require a well-defined, secure, and controlled environment. Such a framework is essential to prevent leaks of sensitive findings. The current state of red teaming is often described as “more art than science” and lacks standardized practices that allow for meaningful comparisons across different AI models. Comprehensive Reporting: Per section 4.2 of the Executive Order on AI, we will require the submission of the results of red teaming exercises to a central authority, along with a detailed log of the technical steps taken to mitigate the identified harms. This authority should focus on understanding the vulnerabilities identified and evaluating the effectiveness of countermeasures. It should also compare safety reports from different models to establish a risk scale and evaluate the extent to which these models encourage risky behavior beyond what a motivated individual can achieve on the Internet alone. Stakeholder access and reporting thresholds: Hold open cross-sector discussions on whether broader stakeholder access to red teaming results is justified and how to implement scalable reporting standards. In addition, broader consideration is needed about who constitutes the red team in question. For example, consider a tiered reporting strategy modeled on incident reporting systems used in various sectors. This approach would not only require reporting from organizations but would also invite independent, voluntary reporting from citizen oversight groups. Accountability and transparency: Establish clear protocols for disclosing how vulnerabilities are addressed, ensuring a balance between transparency and protection of intellectual property. This includes setting clear standards for what constitutes a successful red teaming exercise and outlining mechanisms to hold accountable companies that leverage findings to strengthen the safety of the model. As discussed in (2), the reporting process needs a central authority. This central authority should oversee the identification of national security threats and, where appropriate, streamline the framework for publicly disclosing such findings.
Red teaming can be a powerful post-training tool to stress test AI models before releasing them into the wild. But without clear disclosure standards, the impact of red teaming efforts and, above all, the effectiveness of countermeasures to vulnerabilities is significantly reduced. Before rushing to embrace a fully transparent discussion of red team results, legitimate concerns about the potential risks of disseminating sensitive information should be carefully considered. Steps taken to develop red teaming standards are essential to guide AI development that prioritizes safety, security, and ethical integrity.
However, it is equally important to recognize that red teaming is not a standalone solution to all safety challenges related to AI models. Its success depends heavily on predefined criteria of what is considered “good” or “trustworthy” AI. Therefore, red teaming conversations need to be part of a broader discussion that sets expectations for AI models and clarifies what constitutes their acceptable use.