Do you host your own ML / AI / LLM? What do you use, and what do you use it for?
i don’t use it at all, i do want some selfhosted speech to text model (whisper?) but my computer is ancient so it would be awfully slow. i have some multi hour audio recordings from presentations, would be nice to have them in text and searchable…
How ancient is ancient? TTS and STT is much lighter than llm…you might have more capability than you think, especially if you’re doing batch processing like that.
a haswell xeon e5-1650 machine, i remember running llama 7b in llama.cpp in like 2023 and it was quite sluggish. guess i should try whisper at some point…
Ha. You were doing inference on CPU on a haswell era. Been there, done that.
OTOH…whisper.cpp is heavily optimised for it.
Plus, you’re doing batch transcription, not real-time, so slow doesn’t actually matter.
Fire Whisper small or medium overnight and wake up to searchable text.
PS: if you want a good fast little llm, something like Qwen 3.5 2B will work well on the Xeon.
Yup, ollama, various models. I initially downloaded it because I, along with thousands of other people, wanted to see what would happen if I made models debate with each other after RAGging them with various books (The Prince, The Art of War, The complete works of Shakespeare, etc.).
The results were uninteresting and I abandoned the project pretty quickly. I’ll sometimes use them for code analysis but they’re too slow on my rig to be really useful.
Did you use OWUIs native “call simultaneous models to answer” feature for that or one of the AI debate harnesses?
Nothing so fancy. I just made a little python script to prompt the first model, wait for a response, then prompt the next model with the initial prompt + the response, and so on. It was very hacky and slow.
Ah - I thought you might have used something like this
Oh neat. Yeah, if something like that had existed (and I’d been aware of it) I probably would have used it instead of building my own shoestring version.
wanted to see what would happen if I made models debate
LOL I kind of do that…sort of. I’ll ask several AI the very same question to see what they spit out.
You’ll like this then
Well I’ll be damned. Of course the law of large numbers dictates someone, somewhere has the same thought.
One of the projects I started and never got to a satisfactory end state was basically that, plus a judging round. Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness. Then the results would get logged to a spreadsheet.
It’s simple enough, but for N models it requires N + N^2 model calls so it takes forever to run any decent dataset on consumer hardware. If I had the resources and a way to run it that didn’t fry the planet, I think it would be a cool running set of comparative benchmarks. IDK if it’d be useful at all but I’m still interested to see the data.
Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness
If I understand correctly I sorta kinda do that. I’ll copy and paste one AI’s response into another and prompt something like 'Validate AI response: and paste it in. HAHA I thought I was being tricky but you’re already on it.
I think it’s tricky. It’s kind of like adding LLMs like vectors, and hopefully the effect can soften or at least reveal the shortcomings of individual models. Is it a good idea? I don’t know, I think there are good reasons to think it’s a waste of time and resources. I certainly think I’d need a better explanation of what use it would be before I spent more time building it. But I still think about what use it would be from time to time; I haven’t decided that it’s a bad idea yet.
at least reveal the shortcomings of individual models. Is it a good idea? I don’t know,
I mean I do it, in my rudimentary way, to check for some semblance of consistency. I’m unclear why you think that not a good idea?
P.S. This is a hypothesis, I haven’t even designed the test for it, much less run it. What follow are my suppositions.
I think whether or not it’s a good idea depends on how similar all the models are. I don’t have a rigorous definition of “similar” but things like similar training data, similar design methodologies, similar QA processes would all contribute. Theoretically (I think), if they’re all dissimilar, they should each catch errors the others miss. However, the more similar they are, the more likely they have the same biases and weak spots, and your error rate from a response + verification may be the same or even higher than the error rate for just the original prompt, and you’d be unlikely to detect those errors using just two similar models. It can instill false confidence in the results because you’re doing something that should in theory increase the validity of the data, but in practice might make no difference or even make the quality of responses worse.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:
Fewer Letters More Letters Git Popular version control system, primarily for code LTS Long Term Support software version SSH Secure Shell for remote terminal access
3 acronyms in this thread; the most compressed thread commented on today has 3 acronyms.
[Thread #27 for this comm, first seen 25th Jun 2026, 15:40] [FAQ] [Full list] [Contact] [Source code]
I currently run Qwen3.6-27b on llama.cpp and use it via openwebui. Mostly, I use it for web research via tavily, to a lesser extent for coding and interactively learning about things that are new to me but common in training data (such as basic math or ML concepts).
I have a simple slow model running on CPU in my cluster for karakeep. I’ve tried running a variety of models on my 7900XT but even with 16GB their performance just isn’t there. My new work m5 Mac book with 48GB of ram is the first time I’ve seen usable performance for local models and it has been pretty impressive.
Yeah, I’m using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp’s such as ddg-search to make it truly useful by actually being able to search online.
I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.
Not to say that I couldn’t do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.
I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I’m having trouble finding certain information. I’ll ask it to find me some resources to look at.
Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.
What I don’t like is the way companies try to market it to people. I don’t believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don’t expect a machine to be able to decide things for me or to be some filter between me and others.
Yes. Currently using Gemma4:12b behind OpenWebUI and Hermes Agent plus a few lighter models for OCR and tagging in Paperless.
No. I still have no use for it and everything I use is automated without at a far lower footprint.
I’m using anythingllm. It’s quite easy to setup and use. I’m impressed of the perf on comodity hardware.
I’ve played with it for Home Assistant integration, but I just dont have much interest in it, the whole thing is too inefficient at the moment, and the tiny models that can run in a few gigs of system ram on an ipgu or npu arent good enough in quality or speed to rely on.
Hopefully some future generation micro-models will be more useful for the way I want to use it (aka , ultra light, no dedicated hardware etc.), but for now it’s a lot of compute resources, plus heat and energy for a gimmick.
Agreed. It will be ironic if 1.58B models (Microsoft) turns out to be the great white hope.
I looked at the recent Steam stats (which is a GPU sample of convenience); the most common GPU size was 6GB. Meanwhile you probably need what…64GB unified memory or a 5090 to drive a decent model at a decent speed/context?
There’s a real gap between the haves and the have nots and it’s widening.
Yep.
Ollama + about 8 different models at the moment, hosted on a mac mini with open webui as a front end.
Predominantly for transcription, translation, an extra round of security checks on code, a more context friendly home assistant interface, and a daily run of context evaluation on property I’m looking for with a lot of specific needs (acreage, min elevation change, soil type, area, etc).
mac mini
How? What is your average response time?
Apple silicon is pretty good at it as long as you’ve got the ram for it. I wouldn’t do less than 16GB.
A few seconds for most of the tasks
I have to recommend switching to llamacpp. It’s SO much faster than ollama.
On the list but haven’t gotten to it yet, but I know I should. I could probably get a bit more out of that box with it, expand the context windows a bit…
What spec Mini do you use?
Just an m2 w/ 16gb I repurposed.
Can’t really do a lot at once, and the context is limited, but it does the trick. I’d buy a few more if I saw them at the right price.
Nice, I’ve got a Mac Studio M1 Max with 32GB of RAM that I use with Ollama and then I host OpenWebUI and OpenCode on my Arch Server. I use the Mac as a primary workstation, so it’s a little rough when I start running a model. I’m sure I could probably do and learn more about Ollama to improve my experience, but for now it works for certain tasks.
I got mine a few years back for some iOS builds, don’t need to do them that often so it became the model host for me
I started running LLMs a couple months ago on my own hardware. I have a Framework Desktop that I ordered last year and also recently picked up a refurbished 24GB AMD RX 7900 XTX which I’m doing some performance testing against. The dGPU is much better for dense models, and slightly faster for MoE if I’m willing to run them at a lower quant – but uses more power and has annoying coil whine. The Framework Desktop uses ~100W under load, is quieter, and for the MoE models already runs them fast enough for most of my needs – so most of my LLM use happens on that system still.
For software: I’m using ollama on the Framework currently, but I want to replace it with just using llama.cpp directly eventually. I’ve been using llama-cli for testing the dGPU. I wrote my own chat client to interact with ollama as well as a few other programs for specific tasks.
I’ve been using the LLMs for a mix of research (both personal and professional), entertainment, practical coding tasks (mostly debugging and brainstorming, plus a bit of UI prototyping, automatic generation of sequence diagrams for documentation, and light scripting), as well as automation of tedious tasks.
As an example of the latter, people often send me requests to prepare data sets by email but don’t specify the sources they want precisely so I have to go match the name against the real name in our archives; LLMs are great for mapping the imperfect name – with typos, missing prefixes, incorrect addition of spaces, addition/removal of hyphens, etc. – to the exact name I actually need to pull the data off disk when given a lookup table to compare against.
As far as models go, I’m mostly using various Qwen 3.6 and Gemma4 variants. I have multiple versions of each for different purposes. llmfan46’s uncensored Qwen 3.6 35B-A3B @ Q6_K (from Hugging Face) is my default model currently.
Yes, llama-swap and I use it for home assistant text-gen notifications, basic coding tasks, etc
If anyone here self-hosts definitely check out llama-swap as it has some nifty features for hotswapping LLMs, image generation models and voice models.
I hosted Qwen 3.5 9b uncensored on my site at https://masland.tech/ for a while. I didn’t really use it and no one else used it so I took it down. These days I’m spending most of my time finding uses for AI and accessibility. One of the next things I’m planning is a video to text reasoning system, primarily for the purpose of grading used electronic devices.
I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.
Conclusion, they are both fucking useless. Free tier claude runs circles.
Yeah :(
Were not there yet on consumer rigs.
Did you serve them with ollama?
It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.
Is there an alternative to ollama? The point was to run something locally.
https://sleepingrobots.com/dreams/stop-using-ollama/
And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.
LM Studio is better, and easy.
If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).
It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.
But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.
Oh, and I just saw you have a 3090.
To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main
Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF
If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.
If you have 64GB, I’d suggest a quantization of Step 3.7.
If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.
If you just pulled the default version of qwen3.5 from ollama’s repo you downloaded a mediocre one that only uses ~6GB.
Check
ollama show qwen3.5and see if you get something like this in the result:Model architecture qwen35 parameters 9.7B context length 262144 embedding length 4096 quantization Q4_K_MThis is the default version I got when I first tried using ollama without any experience. It worked, but it’s a heavily quantized, lower parameter version of the model – i.e. it’s pretty dumb – compared to what you can actually run on your hardware.
I will check it later. I loaded whichever one cluade suggested lol







