Research

Advancing the frontier of AI safety through rigorous research, open-source tools, and practical deployments.

Guard ModelsMarch 2025

Constitutional Classifiers: From 86% to Near-Zero Jailbreak Success

We present a novel approach to building safety classifiers using constitutional AI principles. By encoding safety constraints directly into the training process, we achieve near-zero jailbreak success rates on standard benchmarks, compared to 86% baseline failure rates.

SafetyGuard ModelsConstitutional AI
EvaluationFebruary 2025

Petri: An Adversarial Evaluation Framework for LLM Safety

Petri is a comprehensive evaluation framework for testing LLM safety under adversarial conditions. With a 200-seed prompt library and an auditor-target-judge architecture, Petri provides reproducible safety assessments across multiple attack vectors.

EvaluationAdversarialLLM Safety
ArchitectureJanuary 2025

Five-Layer Guard Agent Architecture for Autonomous Systems

We propose a five-layer architecture for securing autonomous AI agents: input validation, output filtering, memory integrity checks, tool call auditing, and behavioral drift detection. This architecture forms the foundation of our Trishool product.

ArchitectureAgentic SecurityRuntime Protection
AlignmentDecember 2024

Covenant72B: Constitutional Alignment at Scale

Covenant72B demonstrates that constitutional AI approaches can scale to 72 billion parameter models. We detail the training methodology, benchmark results, and lessons learned from aligning a large-scale model with safety principles.

AlignmentScaleConstitutional AI
EvaluationNovember 2024

Evaluating Guard Models Against Emerging Attack Vectors

A comprehensive evaluation of current guard models against newly discovered attack vectors. We identify vulnerabilities in existing solutions and propose improvements for next-generation safety classifiers.

EvaluationGuard ModelsSecurity
SecurityOctober 2024

Runtime Monitoring for Agentic AI Systems

Techniques for monitoring AI agents in production to detect and prevent unsafe behavior. Our approach combines rule-based heuristics with learned anomaly detection for comprehensive runtime protection.

SecurityRuntimeMonitoring

Stay Updated

Subscribe to our research newsletter to receive updates on new publications and findings.

Subscribe