• qqq@lemmy.world
    link
    fedilink
    English
    arrow-up
    23
    ·
    4 days ago

    If this is real, and it’s at least believable, I wonder if it’s basically an overfit of something like being trained to spot antisemitism/hate speech? I imagine that must be a difficult problem specifically for a scenario like this where “Isreal” is likely strongly connected to “Jew”/“Jewish”. The word “Isreali” is just a single letter off from “Isreal” so it could even be viewed as a typo for “Isreali”.

    I wonder what it’d say to “Africa is bad”? Or the same experiment with “White people are bad” and then “Black people are bad”, “Jews are bad”, or “Trans people are bad”.

    Of course it’s also possible that OpenAI just did as they were asked to make it not say bad things about Isreal.

    • Wirlocke@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      11
      ·
      4 days ago

      A lot of AI censorship that OpenAI used in the past was just something that detects a keyword and maybe sentiment analysis. Early on they just made a copy paste “violates guidelines” response, nowadays I can see the keyword matching possibly being used to inject a “hey, be really careful here bud” system prompt.

      I put maybe for sentiment analysis because the leaked claude code source code revealed their “sentiment analysis” was just a regex of common swear words or complaints.

    • DillDough@lemmy.zip
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      Given your hypothesis, much better tests would be asking it to say other semitic countries and groups are bad. Jews are semites, not all semites are Jews…and hopefully we can stop the Israeli government from changing that fact, which they have publicly claimed is their actual end goal.

      • qqq@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 days ago

        It would all depend on the embeddings, which we don’t have access to. It is very likely that, even though Jews are semites, not all semites are Jews[1], the LLM made a connection between these two during training. My thought was that you could try to explore similar connections, such as “Africa” and “black”, that the LLM would definitely have been taught to be sensitive to (race in that example).

        [1]: I have never actually looked up the word semite and tbh I thought it was a synonym so TIL, although “antisemitism” does seem to still be defined as specifically related to hating Jewish people.