Do you host your own AI?

SuspiciousCarrot78@aussie.zone · 10 days ago

Do you host your own AI?

hexagonwin@lemmy.today · 10 days ago

i don’t use it at all, i do want some selfhosted speech to text model (whisper?) but my computer is ancient so it would be awfully slow. i have some multi hour audio recordings from presentations, would be nice to have them in text and searchable…

SuspiciousCarrot78@aussie.zone · 9 days ago

How ancient is ancient? TTS and STT is much lighter than llm…you might have more capability than you think, especially if you’re doing batch processing like that.

hexagonwin@lemmy.today · 9 days ago

a haswell xeon e5-1650 machine, i remember running llama 7b in llama.cpp in like 2023 and it was quite sluggish. guess i should try whisper at some point…

SuspiciousCarrot78@aussie.zone · 9 days ago

Ha. You were doing inference on CPU on a haswell era. Been there, done that.

OTOH…whisper.cpp is heavily optimised for it.

Plus, you’re doing batch transcription, not real-time, so slow doesn’t actually matter.

Fire Whisper small or medium overnight and wake up to searchable text.

PS: if you want a good fast little llm, something like Qwen 3.5 2B will work well on the Xeon.

queerlilhayseed@piefed.blahaj.zone · 10 days ago

Yup, ollama, various models. I initially downloaded it because I, along with thousands of other people, wanted to see what would happen if I made models debate with each other after RAGging them with various books (The Prince, The Art of War, The complete works of Shakespeare, etc.).

The results were uninteresting and I abandoned the project pretty quickly. I’ll sometimes use them for code analysis but they’re too slow on my rig to be really useful.

SuspiciousCarrot78@aussie.zone · 10 days ago

Did you use OWUIs native “call simultaneous models to answer” feature for that or one of the AI debate harnesses?

queerlilhayseed@piefed.blahaj.zone · 10 days ago

Nothing so fancy. I just made a little python script to prompt the first model, wait for a response, then prompt the next model with the initial prompt + the response, and so on. It was very hacky and slow.

SuspiciousCarrot78@aussie.zone · 10 days ago

Ah - I thought you might have used something like this

https://github.com/hereisSwapnil/ai-council

queerlilhayseed@piefed.blahaj.zone · 10 days ago

Oh neat. Yeah, if something like that had existed (and I’d been aware of it) I probably would have used it instead of building my own shoestring version.

irmadlad@lemmy.world · 10 days ago

wanted to see what would happen if I made models debate

LOL I kind of do that…sort of. I’ll ask several AI the very same question to see what they spit out.

SuspiciousCarrot78@aussie.zone · 10 days ago

You’ll like this then

https://aisaywhat.org/

irmadlad@lemmy.world · 10 days ago

Well I’ll be damned. Of course the law of large numbers dictates someone, somewhere has the same thought.

queerlilhayseed@piefed.blahaj.zone · 10 days ago

One of the projects I started and never got to a satisfactory end state was basically that, plus a judging round. Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness. Then the results would get logged to a spreadsheet.

It’s simple enough, but for N models it requires N + N^2 model calls so it takes forever to run any decent dataset on consumer hardware. If I had the resources and a way to run it that didn’t fry the planet, I think it would be a cool running set of comparative benchmarks. IDK if it’d be useful at all but I’m still interested to see the data.

irmadlad@lemmy.world · 10 days ago

Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness

If I understand correctly I sorta kinda do that. I’ll copy and paste one AI’s response into another and prompt something like 'Validate AI response: and paste it in. HAHA I thought I was being tricky but you’re already on it.

queerlilhayseed@piefed.blahaj.zone · 10 days ago

I think it’s tricky. It’s kind of like adding LLMs like vectors, and hopefully the effect can soften or at least reveal the shortcomings of individual models. Is it a good idea? I don’t know, I think there are good reasons to think it’s a waste of time and resources. I certainly think I’d need a better explanation of what use it would be before I spent more time building it. But I still think about what use it would be from time to time; I haven’t decided that it’s a bad idea yet.

irmadlad@lemmy.world · 10 days ago

at least reveal the shortcomings of individual models. Is it a good idea? I don’t know,

I mean I do it, in my rudimentary way, to check for some semblance of consistency. I’m unclear why you think that not a good idea?

queerlilhayseed@piefed.blahaj.zone · 10 days ago

P.S. This is a hypothesis, I haven’t even designed the test for it, much less run it. What follow are my suppositions.

I think whether or not it’s a good idea depends on how similar all the models are. I don’t have a rigorous definition of “similar” but things like similar training data, similar design methodologies, similar QA processes would all contribute. Theoretically (I think), if they’re all dissimilar, they should each catch errors the others miss. However, the more similar they are, the more likely they have the same biases and weak spots, and your error rate from a response + verification may be the same or even higher than the error rate for just the original prompt, and you’d be unlikely to detect those errors using just two similar models. It can instill false confidence in the results because you’re doing something that should in theory increase the validity of the data, but in practice might make no difference or even make the quality of responses worse.

Decronym@lemmy.decronym.xyz · edit-2 9 days ago

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
Git	Popular version control system, primarily for code
LTS	Long Term Support software version
SSH	Secure Shell for remote terminal access

3 acronyms in this thread; the most compressed thread commented on today has 3 acronyms.

[Thread #27 for this comm, first seen 25th Jun 2026, 15:40] [FAQ] [Full list] [Contact] [Source code]

robber@lemmy.ml · 9 days ago

I currently run Qwen3.6-27b on llama.cpp and use it via openwebui. Mostly, I use it for web research via tavily, to a lesser extent for coding and interactively learning about things that are new to me but common in training data (such as basic math or ML concepts).

eodur@piefed.social · 9 days ago

I have a simple slow model running on CPU in my cluster for karakeep. I’ve tried running a variety of models on my 7900XT but even with 16GB their performance just isn’t there. My new work m5 Mac book with 48GB of ram is the first time I’ve seen usable performance for local models and it has been pretty impressive.

D_Air1@lemmy.ml · 10 days ago

Yeah, I’m using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp’s such as ddg-search to make it truly useful by actually being able to search online.

I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.

Not to say that I couldn’t do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.

I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I’m having trouble finding certain information. I’ll ask it to find me some resources to look at.

Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.

What I don’t like is the way companies try to market it to people. I don’t believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don’t expect a machine to be able to decide things for me or to be some filter between me and others.

StrawberryPigtails@discuss.tchncs.de · 10 days ago

Yes. Currently using Gemma4:12b behind OpenWebUI and Hermes Agent plus a few lighter models for OCR and tagging in Paperless.

Strider@lemmy.world · 9 days ago

No. I still have no use for it and everything I use is automated without at a far lower footprint.

alexquiniou@lemmy.zip · 9 days ago

I’m using anythingllm. It’s quite easy to setup and use. I’m impressed of the perf on comodity hardware.

Faceman🇦🇺@discuss.tchncs.de · 10 days ago

I’ve played with it for Home Assistant integration, but I just dont have much interest in it, the whole thing is too inefficient at the moment, and the tiny models that can run in a few gigs of system ram on an ipgu or npu arent good enough in quality or speed to rely on.

Hopefully some future generation micro-models will be more useful for the way I want to use it (aka , ultra light, no dedicated hardware etc.), but for now it’s a lot of compute resources, plus heat and energy for a gimmick.

SuspiciousCarrot78@aussie.zone · edit-2 10 days ago

Agreed. It will be ironic if 1.58B models (Microsoft) turns out to be the great white hope.

I looked at the recent Steam stats (which is a GPU sample of convenience); the most common GPU size was 6GB. Meanwhile you probably need what…64GB unified memory or a 5090 to drive a decent model at a decent speed/context?

There’s a real gap between the haves and the have nots and it’s widening.

curbstickle@anarchist.nexus · 10 days ago

Yep.

Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.

Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I’m looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).

irmadlad@lemmy.world · 10 days ago

mac mini

How? What is your average response time?

curbstickle@anarchist.nexus · 10 days ago

Apple silicon is pretty good at it as long as you’ve got the ram for it. I wouldn’t do less than 16GB.

A few seconds for most of the tasks

surewhynotlem@lemmy.world · 10 days ago

I have to recommend switching to llamacpp. It’s SO much faster than ollama.

curbstickle@anarchist.nexus · 10 days ago

On the list but haven’t gotten to it yet, but I know I should. I could probably get a bit more out of that box with it, expand the context windows a bit…

async_amuro@lemmy.zip · 10 days ago

What spec Mini do you use?

curbstickle@anarchist.nexus · 10 days ago

Just an m2 w/ 16gb I repurposed.

Can’t really do a lot at once, and the context is limited, but it does the trick. I’d buy a few more if I saw them at the right price.

async_amuro@lemmy.zip · 10 days ago

Nice, I’ve got a Mac Studio M1 Max with 32GB of RAM that I use with Ollama and then I host OpenWebUI and OpenCode on my Arch Server. I use the Mac as a primary workstation, so it’s a little rough when I start running a model. I’m sure I could probably do and learn more about Ollama to improve my experience, but for now it works for certain tasks.

curbstickle@anarchist.nexus · 10 days ago

I got mine a few years back for some iOS builds, don’t need to do them that often so it became the model host for me

e0qdk@reddthat.com · 10 days ago

I started running LLMs a couple months ago on my own hardware. I have a Framework Desktop that I ordered last year and also recently picked up a refurbished 24GB AMD RX 7900 XTX which I’m doing some performance testing against. The dGPU is much better for dense models, and slightly faster for MoE if I’m willing to run them at a lower quant – but uses more power and has annoying coil whine. The Framework Desktop uses ~100W under load, is quieter, and for the MoE models already runs them fast enough for most of my needs – so most of my LLM use happens on that system still.

For software: I’m using ollama on the Framework currently, but I want to replace it with just using llama.cpp directly eventually. I’ve been using llama-cli for testing the dGPU. I wrote my own chat client to interact with ollama as well as a few other programs for specific tasks.

I’ve been using the LLMs for a mix of research (both personal and professional), entertainment, practical coding tasks (mostly debugging and brainstorming, plus a bit of UI prototyping, automatic generation of sequence diagrams for documentation, and light scripting), as well as automation of tedious tasks.

As an example of the latter, people often send me requests to prepare data sets by email but don’t specify the sources they want precisely so I have to go match the name against the real name in our archives; LLMs are great for mapping the imperfect name – with typos, missing prefixes, incorrect addition of spaces, addition/removal of hyphens, etc. – to the exact name I actually need to pull the data off disk when given a lookup table to compare against.

As far as models go, I’m mostly using various Qwen 3.6 and Gemma4 variants. I have multiple versions of each for different purposes. llmfan46’s uncensored Qwen 3.6 35B-A3B @ Q6_K (from Hugging Face) is my default model currently.

Jakeroxs@sh.itjust.works · 9 days ago

Yes, llama-swap and I use it for home assistant text-gen notifications, basic coding tasks, etc

If anyone here self-hosts definitely check out llama-swap as it has some nifty features for hotswapping LLMs, image generation models and voice models.

jaykrown@lemmy.world · 9 days ago

I hosted Qwen 3.5 9b uncensored on my site at https://masland.tech/ for a while. I didn’t really use it and no one else used it so I took it down. These days I’m spending most of my time finding uses for AI and accessibility. One of the next things I’m planning is a video to text reasoning system, primarily for the purpose of grading used electronic devices.

Steve@startrek.website · 10 days ago

I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.

Conclusion, they are both fucking useless. Free tier claude runs circles.

SuspiciousCarrot78@aussie.zone · 10 days ago

Yeah :(

Were not there yet on consumer rigs.

brucethemoose@lemmy.world · 10 days ago

Did you serve them with ollama?

It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

Steve@startrek.website · 10 days ago

Is there an alternative to ollama? The point was to run something locally.

brucethemoose@lemmy.world · edit-2 10 days ago

https://sleepingrobots.com/dreams/stop-using-ollama/

And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

LM Studio is better, and easy.

If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.

But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

brucethemoose@lemmy.world · 10 days ago

Oh, and I just saw you have a 3090.

To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

If you have 64GB, I’d suggest a quantization of Step 3.7.

If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.

e0qdk@reddthat.com · 10 days ago

If you just pulled the default version of qwen3.5 from ollama’s repo you downloaded a mediocre one that only uses ~6GB.

Check ollama show qwen3.5 and see if you get something like this in the result:

  Model
    architecture        qwen35    
    parameters          9.7B      
    context length      262144    
    embedding length    4096      
    quantization        Q4_K_M

This is the default version I got when I first tried using ollama without any experience. It worked, but it’s a heavily quantized, lower parameter version of the model – i.e. it’s pretty dumb – compared to what you can actually run on your hardware.

Steve@startrek.website · 9 days ago

I will check it later. I loaded whichever one cluade suggested lol