From Unexpected Outputs to Real-World Risk: A Practical Look at AI Behaviour

Written By

Marc Briggs

Published on

4 September 2025

One of the more thought-provoking discussions I had recently started during The-C2 conference and continued in a meeting shortly after with a delegate I’d connected with there. The main programme had rightly covered a spectrum of AI and cyber issues, but this follow-up conversation focused in on something specific: how we identify and disclose vulnerabilities in AI systems, and how poorly the traditional models of responsible disclosure are holding up in this new context.

At SE Labs, we’ve tested and evaluated a wide range of security technologies over the years, including those involving AI. But this conversation prompted a deeper reflection on how we, as an industry, respond when AI systems start to behave in ways we didn’t predict or intend – and what that means for cyber security leaders tasked with managing that risk.

Where the Current Model Falls Short

The standard model for vulnerability disclosure is well established: a researcher finds a bug, they disclose it confidentially to the vendor, the vendor develops a fix, and a coordinated public advisory is eventually issued.

That process works well for conventional software. But when it comes to AI systems, especially generative or decision-making models, the same playbook begins to unravel. These systems often aren’t “broken” in the traditional sense. There’s no buffer overflow or unvalidated input to point to. Instead, the risk lies in how the model responds, or how its responses can be manipulated to bypass intended safeguards.

It’s not a problem with the code. It’s a problem with how the model interprets the world. That’s much harder to pin down.

Understanding What AI Vulnerabilities Actually Look Like

We discussed a few examples that highlight just how different these risks are.

Prompt injection is one. A user finds a way to craft inputs that override or manipulate a model’s instructions, causing it to ignore safety constraints and act in unintended ways.
Model extraction is another. Here, a determined attacker gradually reconstructs a model’s underlying logic, or even its training data, through repeated probing.
There’s also policy evasion, where a model is tricked into producing harmful content by framing the request in a way that slips past its filters. It’s not unlike social engineering, but aimed at the system’s logic instead of a person.

These are real issues. But they don’t come with clean technical fixes. And they don’t always trigger alarm bells in a standard vulnerability triage process.

The Practical Problem for Businesses

For business and security leaders, the implications are significant. If you’re already using AI tools, either built in-house or integrated from suppliers, then you’re already exposed to this category of risk.

The problem is, you may not have a way to recognise that exposure. If your vulnerability management process assumes all flaws are code-based and patchable, then these types of AI bypasses may never be recorded, let alone resolved.

There’s also the question of responsibility. In many organisations, AI systems are being deployed in pilot programmes or as part of innovation teams, sometimes outside the purview of core security functions. That means when an issue arises, there may be no clear owner or no established process for disclosure.

Why Traditional Disclosure Doesn’t Fit

As the conversation unfolded, we kept coming back to this point: the current disclosure model is built around software vulnerabilities that can be categorised, documented, and patched. But AI systems behave differently.

There often isn’t a clear “fix.” A successful adversarial prompt might not be something you can block with a rule. A risky response might not breach any security control but still cause reputational damage or ethical concerns. And when systems learn or adapt over time, it’s not always possible to reproduce the same outcome consistently.

In this context, asking a researcher to “report a bug” doesn’t make sense. What they’ve found is a behaviour, and behaviours are much harder to define, never mind remediate.

What a Better Approach Might Look Like

We didn’t leave the discussion with a checklist, but we did arrive at a shared sense of what organisations might need to do differently if they want to stay ahead of these risks.

First, organisations need a channel for behavioural disclosures, not just software bugs. That might mean adapting existing vulnerability disclosure processes to include AI-specific categories. It might also mean establishing a dedicated point of contact for researchers who find unexpected or unsafe behaviour in AI systems.

Second, security and AI teams need to collaborate earlier. If AI deployments are happening in isolation from risk and governance teams, there’s a good chance the business is missing important signals. AI systems should be subject to the same scrutiny as other critical technology – and perhaps more, given their complexity and unpredictability.

Third, disclosure should be treated as a collaboration, not a transaction. A researcher reporting a successful prompt injection isn’t pointing to a broken feature, they’re showing where assumptions have failed. That’s valuable intelligence, not an embarrassment. Organisations that treat these findings as opportunities to improve will be in a far stronger position than those who dismiss or deflect.

Finally, third-party AI supply chains must be taken seriously. If your business relies on external models or APIs, do you know how those suppliers handle disclosure? Can you escalate a behavioural issue and expect a response? Is there a shared understanding of accountability? If not, that’s a gap worth closing.

Preparing for AI Risk Means Rethinking the Fundamentals

Our discussion didn’t lead to a definitive solution, and that’s probably right, given how quickly this space is evolving. But it did reinforce a simple idea: AI doesn’t fail the way traditional software does. It fails in ways that feel plausible, persuasive, and sometimes invisible until it’s too late.

For security leaders, that means preparing differently. It means asking whether the business is ready to hear about AI behaviours that aren’t outright breaches, but which signal real risk. It means equipping teams to understand and triage these reports. And it means recognising that many of the assumptions we’ve relied on in traditional vulnerability disclosure simply don’t hold in the world of AI.

As we wrapped up the conversation, one point stuck with me: the systems we’re deploying now don’t break with obvious errors, they bend in subtle ways. If we wait until those bends become breaks, we’ll have missed the opportunity to act.