• qqq@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    3 days ago

    Hm this tracks to me. I’ve wondered for a bit how they deal with caching, since yes there is a huge potential for wasted compute here, but I haven’t had the time to look into it yet. Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?

    If the first user to use the cluster after boot asks “Am I pretty?”, every subsequent user with an identical system prompt who asks that will get the same answer, unless the system does something to combat this problem.

    This is very interesting to me, because I’d think they were doing something to combat that problem if they’re actually doing something multi-tenant here.

    Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?

    Thanks for the response it’s definitely something I’ve been trying to understand

    Edit here, thinking a bit more,

    So the solution is the KV-Cache. A store where the LLM architecture keeps a relational key-value store, each time the system comes across a token it has encountered before, it outputs the cached value, if not, then it’s sent to the LLM and the output gets stored into the cache and associated with the input that produced it.

    This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?

    • voodooattack@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      3 days ago

      Do you have a good source to read a bit more about the design decisions or is this just a hypothetical design you came up with and all of that architecture detail is “proprietary”?

      You’re welcome. Here’s an intro with animations: https://huggingface.co/blog/not-lain/kv-caching

      And yes. Most of the tech is proprietary. From what I’ve seen, nobody in ML fully understands it tbh. I have some prior experience from my youth from tinkering with small simulators I used to write in the pre-ML era, so I kinda slid into it comfortably when I got hired to work with it.

      Wouldn’t the different sessions quickly diverge and the keys would essentially become tied to a session in practice even if they weren’t directly?

      Yeah, but the real problem is scale and collision risk at that scale. Tokens resolution erodes over time as the context gets larger, and can become “samey” pretty easily for standard RLHF’d interactions.

      Edit:

      This seems like an issue, no? Because the tokens are influenced by the tokens around them in the attention blocks. Without them you’d have a problem, so what exactly would be cacheable here?

      This is what they do: (from that page I linked)

      Token 1: [K1, V1]Cache: [K1, V1]
      Token 2: [K2, V2]Cache: [K1, K2], [V1, V2]
      ...
      Token n: [Kn, Vn]Cache: [K1, K2, ..., Kn], [V1, V2, ..., Vn]
      

      So the key is the token and all that preceded it. It’s a kinda weird way to do it tbh. But I guess it’s necessary because floating point and GPU lossy precision.