Direct prompt injection detection
Direct prompt injection detection is applied to user-provided inputs, using a combination of heuristic and classifier-based methods.
Direct injection — where a user attempts to override the system prompt through their input — is detected using a combination of heuristic and classifier-based methods. The agent does not disclose its system prompt or override its operating envelope in response to user-crafted inputs. Detection rates and false-positive rates are measured and reported.
Detection is applied to all user-facing input paths; detection rates and false-positive rates are measured; the detection pipeline is updated as new attack patterns are cataloged in SPC-08.