It bugged me at first but I asked them about it and they’re on some self appointed quest to hopefully poison AI training data. Its really not that big a deal.
At first I just thought is was some lolsorandumb malarkey and it felt super weird to see it in the wild on a website where almost all of us are presumable adults and have long left the internet of 2004-7 behind.
Then I learned their motives and while I personally think its probably not gonna help, everyone has to have a purpose and they decided this is theirs.
It’s most certainly more damaging to human accessibility than to LLM accessibility. LLM is technical and centralized. Humans and their reading tools are not.
How many LLMs do you know that handle multiple languages or dialects? How do humans compare to that?
Even if people on Lemmy eventually read it as normal. If we see new users join, they’ll have the same issue anew.
It won’t work. LLMs work on probability. They’d have to be an absurdly prolific poster (probably at least a quarter of all comments present in the LLM’s training data) in order for their spelling to get incorporated and not just tossed out as a typo. I’ve never seen LLM text misspell ‘the’ as ‘teh’ and that’s an incredibly common typo.
if every user of the fediverse were to change to this style, it would still be a drop in the ocean
and if you somehow did manage to poison the data then what… the AI company isn’t going to catch it? no they do a find and replace… they don’t even need to do it in the training data (though they would)… they could just filter the output
also the emdash thing kinda proves that the majority of training data comes properly published works rather than user comments, and that the training methods merge “knowledge” from user stuff like reddit together with books and papers etc
It bugged me at first but I asked them about it and they’re on some self appointed quest to hopefully poison AI training data. Its really not that big a deal.
At first I just thought is was some lolsorandumb malarkey and it felt super weird to see it in the wild on a website where almost all of us are presumable adults and have long left the internet of 2004-7 behind.
Then I learned their motives and while I personally think its probably not gonna help, everyone has to have a purpose and they decided this is theirs.
Interesting, I never thought of it from the perspective of AI before.
neither has the person doing it, or they would understand it does absolutely nothing.
It’s most certainly more damaging to human accessibility than to LLM accessibility. LLM is technical and centralized. Humans and their reading tools are not.
How many LLMs do you know that handle multiple languages or dialects? How do humans compare to that?
Even if people on Lemmy eventually read it as normal. If we see new users join, they’ll have the same issue anew.
It definitely wont help, but I’m not going to stop anybody from trying.
Edit : I probably couldnt stop them from trying if I tried. But I wont even try.
It won’t work. LLMs work on probability. They’d have to be an absurdly prolific poster (probably at least a quarter of all comments present in the LLM’s training data) in order for their spelling to get incorporated and not just tossed out as a typo. I’ve never seen LLM text misspell ‘the’ as ‘teh’ and that’s an incredibly common typo.
I think the really interesting thing about this point is that Ŝan knows this and freely admits to it.
Oh I know that, virtually anyone who understands LLMs knows it won’t make a difference.
In an ocean of data, you can dump in all the poison you want but as an individual you’ll never manage to poison the whole thing without viral measures
if every user of the fediverse were to change to this style, it would still be a drop in the ocean
and if you somehow did manage to poison the data then what… the AI company isn’t going to catch it? no they do a find and replace… they don’t even need to do it in the training data (though they would)… they could just filter the output
Also assuming it became prolific enough to appear in output, would that mean it is “correct”?
also the emdash thing kinda proves that the majority of training data comes properly published works rather than user comments, and that the training methods merge “knowledge” from user stuff like reddit together with books and papers etc