• voodooattack@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    3 days ago

    Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?

    You’re welcome. Here’s an intro with animations: https://huggingface.co/blog/not-lain/kv-caching

    And yes. Most of the tech is proprietary. From what I’ve seen, nobody in ML fully understands it tbh. I have some prior experience from my youth from tinkering with small simulators I used to write in the pre-ML era, so I kinda slid into it comfortably when I got hired to work with it.

    Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?

    Yeah, but the real problem is scale and collision risk at that scale. Tokens resolution erodes over time as the context gets larger, and can become “samey” pretty easily for standard RLHF’d interactions.

    Edit:

    This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?

    This is what they do: (from that page I linked)

    Token 1: [K1, V1]Cache: [K1, V1]
    Token 2: [K2, V2]Cache: [K1, K2], [V1, V2]
    ...
    Token n: [Kn, Vn]Cache: [K1, K2, ..., Kn], [V1, V2, ..., Vn]
    

    So the key is the token and all that preceded it. It’s a kinda weird way to do it tbh. But I guess it’s necessary because floating point and GPU lossy precision.