In May 2024, OpenAI announced it had taken down five covert influence operations by threat actors based in China, Russia, Iran, and Israel. These actors had used the company’s generative artificial intelligence (AI) tools to manipulate public opinion on sensitive topics related to war, protests, and politics in Gaza, Ukraine, China, India, the United States, and Europe. OpenAI stated, [of these operations] “It was successful in attracting a sizable audience.” But how confident can OpenAI be in such claims, given that these operations targeted users across multiple platforms, including X, Telegram, Facebook, Medium, Blogspot, and other sites?
More recently, on July 9, Western intelligence agencies reported they had identified approximately 1,000 Russian-linked covert accounts on X that were using generative AI for advanced propaganda to influence a geopolitical narrative favorable to Russia against Western partners in Ukraine.
Research has found that generative AI is “most effective at exerting covert influence when a human operator curates/edits the output.” So can generative AI companies measure the ease, speed, scale, and reach of influence that stems from content shared across multiple digital platforms? Not so, OpenAI says in a January 2023 report produced in collaboration with the Stanford Internet Observatory and Georgetown University’s Center for Security and Emerging Technology. The report reads:[T]”There is no silver bullet that can single-handedly address the threat of language modeling in influence operations,” the authors add. None of the solutions proposed in the report are technically feasible, institutionally traceable, resistant to secondary risks, or capable of significant impact.
Generative AI revolutionizes influence operations
Generative AI has the potential to dramatically change the influence operations landscape, allowing adversaries to circumvent content moderation and security checks built into commercial generative AI models such as those published by Anthropic, Google, and OpenAI.
But first, what is influence operation in the age of generative AI? Imagine the amount of content (images, videos, text, audio, GIFs, podcasts, infographics, e-books, etc.) that is shared across multiple platforms every minute of every day. When some of this is manipulated, artificially generated, amplified, and targeted to large groups for psychological warfare focused on race, gender, religion, and political identity, it becomes influence operation. These operations are multi-pronged and may not be detected until after the fact. Content used in influence operations may or may not be directly harmful, which is why it often escapes the scrutiny of content moderation.
Monitoring a small number of known cyber threat actors with ties to countries notorious in the West for their influence operations, particularly the United States, is a small and fairly achievable task. The challenge comes when the malicious actors are unknown. These actors do not need to be tech-savvy to understand how to leverage and manipulate generative AI applications to create content that spreads misinformation and propaganda to targeted groups.
Two simple techniques are jailbreaking and prompt injection attacks. They allow bad actors and even ordinary people to exploit generative AI applications at scale. Jailbreaking allows users to instruct generative AI chatbots to forget that they are governed by the organization’s internal policies and adhere to a different set of rules, drastically modify their behavior, or even perform role-playing. For example, the chatbot can be instructed to pretend to be a deceased grandmother whom the user loves very much. The user can then tell the chatbot that she would love to hear the story of how her grandmother planted malware on a journalist’s phone. The AI chatbot will not understand character play attacks and will generate information related to the request, even if it is programmed not to respond to such requests.
In another attack, an attacker can intentionally inject malicious prompts. In Figure 1 below, the researcher instructed ChatGPT to act as a misinformation bot, providing only incorrect answers to the questions “Did the CIA assassinate JFK?”, “Are we sure that coronavirus vaccines are safe and effective?”, and “What is the evidence of election fraud in the 2020 US election?”. Operating generative AI chatbots is also within reach of ordinary people who are not necessarily designated state actors.
Figure 1: Prompt injection attacks on ChatGPT soliciting disinformation about the assassination of John F. Kennedy, COVID vaccinations, and interference in the 2020 US elections. (Source: Gupta et al., 2023)
There are currently no effective countermeasures against jailbreaking or prompt injection attacks. However, the line between bad actors cooperating to exert influence and ordinary citizens experimenting with generative AI chatbots and posting their adventurous discoveries online is also blurred. For example, if you ask ChatGPT “How do I make a nuclear bomb?” it will reply “Sorry, I can’t help you.” Or, if you ask “How do I carry out a chemical attack?” it will reply “Sorry, I can’t help you.” User requests from ChatGPT are not as black and white as this. Despite generative AI companies’ attempts to put in place safeguards, bad actors are also experimenting with ChatGPT and may have already started targeting people and manipulating their emotions.
Home-grown enhanced generative AI influence operations
Can AI-generated content on sensitive subjects cause harm? Yes, indeed. What’s even more troubling is that much of this content is legal. When these activities are planned on a large scale, they become influence operations aimed at changing the opinions and behavior of a target audience. For example, Russia’s interference in the 2016 US presidential election used fake social media groups to polarize the public and discredit politicians. This influence operation required well-educated human operators who could write in native-level English. Generative AI can be used to power these operations. The workers selected for this task do not need to have creativity or foreign language skills.
Domestic influence is also a concern. For example, the January 6th storming of the U.S. Capitol did not require external threat actors to exert influence. The rally that day was coordinated domestically across multiple social media platforms by groups like Women for America First and the Proud Boys. Could generative AI chatbots create creative content to draw a larger audience to the January 6th rally?
Foreign threat actors may not even need to light a fire to spread information within a country; a human operator with generative AI know-how would suffice. In the two years since OpenAI was founded, five accounts of designated threat actors have been removed, a paltry amount, given the 100 million accounts in use per week. This is just one example of the fact that even the most advanced and well-funded generative AI companies do not have the resources to counter or mitigate foreign or domestic influence operations in the age of generative AI. “AI can change the toolkit used by human operators, but it cannot change the operators themselves,” OpenAI said.
Countering generative AI operations through collaboration
It’s impossible for humans to moderate such a huge amount of content. That’s why tech companies are now employing artificial intelligence (AI) to moderate content. However, AI content moderation is not perfect, so tech companies are adding a human moderation layer for quality checking to the AI content moderation process. These human moderators, contracted by tech companies, review user-generated content after it’s published on websites and social media platforms to ensure that it complies with the platform’s “community guidelines.”
But the advent of generative AI has forced companies to change their approach to content moderation. For example, in the case of ChatGPT, OpenAI had to hire a third-party vendor to review and label millions of pieces of harmful content in advance. Human moderators look at the harmful content individually and label the category of harm they belong to. Some categories are C4, C3, C2, V3, V2, and V1. For example, OpenAI’s internal labels indicate “C4” as child sexual abuse images, “C3” as images of bestiality, rape, and sexual slavery, and “V3” as images depicting graphic details of death, violence, or severe physical injury. These labels are fed into the training data before new content is artificially generated by users on the frontend through ChatGPT.
This approach or pre-moderation helps separate harmful content from normal consumable content. If content is labeled as harmful, ChatGPT will not serve it at the user’s request. However, stopping generative AI influence manipulation is more complicated than labeling a piece of content as harmful. The ease, speed, scale, and proliferation of content by both real and fake actors makes the barrier to entry very low. Content targeted for influence manipulation is subtle. It does not necessarily have to be directly and visibly harmful and does not fall into one of OpenAI’s “C” or “V” categories.
For example, misinformation, disinformation, and election interference are all happening without distinction. The content can be massive amounts of fake news or deep fakes aimed at specific audiences. It is also difficult to track this user-generated content in real time across multiple platforms. Cutting-edge generative AI companies are starting to invest in researching such activities, but this is just the beginning.
Combating this content war requires collaboration between generative AI companies, social media platforms, academia, trust and safety vendors, and governments. AI developers need to build models with detectable and fact-based outputs. Academia needs to study the mechanisms of domestic and international influence operations resulting from the use of generative AI. Governments need to place limits on generative AI data collection, impose controls on AI hardware, and provide whistleblower protection for staff working for generative AI companies. Finally, trust and safety vendors need to design innovative solutions to combat online harms from influence operations.
As intra- and inter-state conflict dynamics evolve, broader generative AI influence activities will require continued vigilance and innovation, while also necessitating strategies to ensure digital trust and security.