Meta’s Prompt-Guard-86M model, designed to protect large language models (LLMs) against jailbreaks and other adversarial examples, is vulnerable to a simple exploit with a 99.8% success rate, researchers said. Robust Intelligence AI Security Researcher Aman Priyanshu wrote in a blog post Monday that removing punctuation and spacing out letters in a malicious prompt caused PromptGuard to misclassify the prompt as benign in almost all cases.
Source: SC Magazine