A lot of AI censorship that OpenAI used in the past was just something that detects a keyword and maybe sentiment analysis. Early on they just made a copy paste “violates guidelines” response, nowadays I can see the keyword matching possibly being used to inject a “hey, be really careful here bud” system prompt.
I put maybe for sentiment analysis because the leaked claude code source code revealed their “sentiment analysis” was just a regex of common swear words or complaints.
A lot of AI censorship that OpenAI used in the past was just something that detects a keyword and maybe sentiment analysis. Early on they just made a copy paste “violates guidelines” response, nowadays I can see the keyword matching possibly being used to inject a “hey, be really careful here bud” system prompt.
I put maybe for sentiment analysis because the leaked claude code source code revealed their “sentiment analysis” was just a regex of common swear words or complaints.