Constitutional Classifiers: From 86% to Near-Zero Jailbreak Success
We present a novel approach to building safety classifiers using constitutional AI principles. By encoding safety constraints directly into the training process, we achieve near-zero jailbreak success rates on standard benchmarks, compared to 86% baseline failure rates.