AI Security: Navigating the Evolving Landscape with Gray Swan

The rapid advancement of AI, particularly in large language models (LLMs) and coding agents, presents a new frontier in security. While AI offers transformative potential, it also introduces unique vulnerabilities that differ significantly from traditional software. Gray Swan, a company born from extensive research at Carnegie Mellon University, is dedicated to empowering individuals and organizations to use AI safely and securely. Matt Fredrikson and Zico Kolter, co-founders of Gray Swan, discuss the evolving challenges and solutions in AI security, from automated red teaming to enterprise-grade guardrails.

The Unique Challenges of AI Security

Unlike traditional software, AI systems, especially LLMs, exhibit fundamentally different behaviors and possess inherent vulnerabilities. They can be "tricked" in ways analogous to human susceptibility, requiring a distinct security mindset. The widespread adoption of a few core models, like Codex and Claude, amplifies the impact of any discovered vulnerabilities, creating new classes of exploits.

"AI systems have inherent inherent different types of vulnerabilities," explains Zico Kolter. "They can be tricked like people get tricked sometimes, right? And so you need a different mindset about security when you're thinking about AI systems."

Gray Swan's mission is to address these challenges by understanding, testing, and mitigating the security risks associated with deploying AI. This involves treating AI models as potentially untrusted entities and focusing on the security risks they introduce, rather than solely on using AI to enhance traditional cybersecurity.

Red Teaming: Pushing the Boundaries of AI Safety

A core component of Gray Swan's approach is rigorous red teaming. This involves actively trying to break AI models to identify vulnerabilities before malicious actors can exploit them. Gray Swan operates two key initiatives in this space:

These efforts are crucial for testing base models and, increasingly, for evaluating agents built on top of them, which incorporate tool use and downstream applications.

Testing Claude, Codex, and Beyond

Gray Swan's testing methodologies are applied to various cutting-edge models. For instance, they participated in the Mythos preview, focusing on the model's robustness against indirect prompt injection. This involves assessing how well the model maintains its original objectives when exposed to untrusted content fetched from external sources.

Beyond specific model evaluations, Gray Swan assists frontier labs in testing their safeguards against various adversarial activities, including cyber misuse, and provides comprehensive safety and security assessments.

The Rise of AI Agents and the "Lethal Trifecta"

The increasing sophistication of AI agents, capable of autonomous operation and tool use, introduces new attack vectors. The "lethal trifecta," a concept coined by Simon Wilson, describes the core components of prompt injection risks:

  1. Ingesting Untrusted Data: The ability of an agent to process information from external, untrusted sources.
  2. Access to Private Information: The agent's capability to access sensitive internal data.
  3. Exfiltration Capability: The ability to send this private information to an external destination.

When these three elements converge, they create significant security risks. This is particularly relevant for agents that utilize computer use capabilities, such as controlling web browsers or executing code.

Cygnal: Enterprise-Grade Guardrails for AI

To counter these vulnerabilities, Gray Swan offers Cygnal (Signal), a specialized filter model designed to act as guardrails for AI agents. Signal sits between the user, the LLM, and any tool calls, actively monitoring for policy violations.

"The other side of what we do is exactly this defense side. And so this is a model called signal which is essentially a filter model that sits between your user the LLM the LLM and any tool calls and exactly does this level of looking for policy violations," explains Fredrikson.

Signal is a custom-trained model that works best when tailored to specific enterprise policies. This is crucial because, unlike general-purpose base models, enterprises often have unique security requirements and data access restrictions that need to be enforced.

The Trade-off Between Usability and Security

A persistent challenge in AI security is the inherent trade-off between an agent's usability and its security. Limiting an agent's capabilities to enhance security can hinder its effectiveness and utility. Conversely, granting agents broad access and functionality increases their potential attack surface.

"Our goal with signal with shade to assess these vulnerabilities with signal to protect it is to shift that point up and to the right," says Kolter, referring to the Pareto frontier of usability versus security. Gray Swan aims to improve this balance, allowing organizations to deploy powerful AI agents with a higher degree of confidence.

The Future of AI Security: Automation and Specialization

The future of AI security, according to Fredrikson and Kolter, lies in increased automation and specialization.

The Inevitable "Gray Swan" Event

The name "Gray Swan" itself reflects the company's philosophy: acknowledging unlikely but foreseeable events. The founders believe that a major, publicly reported prompt injection breach is an event that, while not guaranteed, is highly probable and can be anticipated.

"A grey swan is an unlikely event you can kind of see coming," Kolter states. "And that's kind of where we are with all this. Right. This is going to happen. We know it's coming. It's not going to shock anyone when it happens."

Gray Swan's mission is to help organizations get ahead of these predictable risks, ensuring that the transformative power of AI can be harnessed responsibly and securely.

Key Takeaways