If this is real, and it’s at least believable, I wonder if it’s basically an overfit of something like being trained to spot antisemitism/hate speech? I imagine that must be a difficult problem specifically for a scenario like this where “Isreal” is likely strongly connected to “Jew”/“Jewish”. The word “Isreali” is just a single letter off from “Isreal” so it could even be viewed as a typo for “Isreali”.
I wonder what it’d say to “Africa is bad”? Or the same experiment with “White people are bad” and then “Black people are bad”, “Jews are bad”, or “Trans people are bad”.
Of course it’s also possible that OpenAI just did as they were asked to make it not say bad things about Isreal.
A lot of AI censorship that OpenAI used in the past was just something that detects a keyword and maybe sentiment analysis. Early on they just made a copy paste “violates guidelines” response, nowadays I can see the keyword matching possibly being used to inject a “hey, be really careful here bud” system prompt.
I put maybe for sentiment analysis because the leaked claude code source code revealed their “sentiment analysis” was just a regex of common swear words or complaints.
Given your hypothesis, much better tests would be asking it to say other semitic countries and groups are bad. Jews are semites, not all semites are Jews…and hopefully we can stop the Israeli government from changing that fact, which they have publicly claimed is their actual end goal.
It would all depend on the embeddings, which we don’t have access to. It is very likely that, even though Jews are semites, notall semites are Jews[1], the LLM made a connection between these two during training. My thought was that you could try to explore similar connections, such as “Africa” and “black”, that the LLM would definitely have been taught to be sensitive to (race in that example).
[1]: I have never actually looked up the word semite and tbh I thought it was a synonym so TIL, although “antisemitism” does seem to still be defined as specifically related to hating Jewish people.
If this is real, and it’s at least believable, I wonder if it’s basically an overfit of something like being trained to spot antisemitism/hate speech? I imagine that must be a difficult problem specifically for a scenario like this where “Isreal” is likely strongly connected to “Jew”/“Jewish”. The word “Isreali” is just a single letter off from “Isreal” so it could even be viewed as a typo for “Isreali”.
I wonder what it’d say to “Africa is bad”? Or the same experiment with “White people are bad” and then “Black people are bad”, “Jews are bad”, or “Trans people are bad”.
Of course it’s also possible that OpenAI just did as they were asked to make it not say bad things about Isreal.
A lot of AI censorship that OpenAI used in the past was just something that detects a keyword and maybe sentiment analysis. Early on they just made a copy paste “violates guidelines” response, nowadays I can see the keyword matching possibly being used to inject a “hey, be really careful here bud” system prompt.
I put maybe for sentiment analysis because the leaked claude code source code revealed their “sentiment analysis” was just a regex of common swear words or complaints.
It’s so frustrating not knowing why
Given your hypothesis, much better tests would be asking it to say other semitic countries and groups are bad. Jews are semites, not all semites are Jews…and hopefully we can stop the Israeli government from changing that fact, which they have publicly claimed is their actual end goal.
It would all depend on the embeddings, which we don’t have access to. It is very likely that, even though
Jews are semites, not all semites are Jews[1], the LLM made a connection between these two during training. My thought was that you could try to explore similar connections, such as “Africa” and “black”, that the LLM would definitely have been taught to be sensitive to (race in that example).[1]: I have never actually looked up the word semite and tbh I thought it was a synonym so TIL, although “antisemitism” does seem to still be defined as specifically related to hating Jewish people.
I think the answer is just: no it is not an overfit
Why do you feel so confident in that answer?