It's self hosting Sunday! What's up selfhosters? Infrastructure edition

tofu@lemmy.nocturnal.garden · 3 months ago

It's self hosting Sunday! What's up selfhosters? Infrastructure edition

TheRagingGeek@lemmy.world · 3 months ago

so this week I was getting ready for my workday when my Son tells me CraftyController is inaccessible, so I tried to SSH into the box that the service is pinned to… nada, dead. tried to power cycle it, nada.

now this node was a B450M-A mobo Ryzen 7 2700X platform with some hodgepodge scrap RAM I’ve had running in it(RAM birthday was 2019). I hooked it up to a mini monitor and a keyboard, but it didn’t post at all, so just a blue screen of no signal. unfortunately the B450M-A mobo didn’t feature POST debug lights, nor did it use QLED, it apparently relied on PC Speaker, and my machine wasn’t telling any tales. so since I had no real idea as to the root cause and after reseating the RAM and the GPU and fiddling with it got me nowhere, I got my partner to approve the outspend for replacement of the motherboard so that I could have actual Debug indicators.

Thursday the ROG B550-F Gaming WIFI II mobo arrived, as did the Ryzen 9 5900XT and the Nautilus 360RS cooler. I spent the evening assembling the mobo and CPU and the GPU, the RAM, and all the related wiring. figured I would do the Cooler the next day. Yesterday I got the cooler in place with some serious hardware acrobatics. I then fired it up and Yellow LED. DRAM issue, so I unseated all of the RAM, plugging in one of the hodgepodge sets(I had 4x8GB ram sticks) neither set worked, went to just trying a single stick. of the 4 sticks only 1 was able to get past the Yellow LED and into completed POST.

So the RAM was shot and I’m not going to run containers on a machine with only 8 GB of ram. so I ordered up some Vengeance LPX 2x16G sticks and they arrived this morning! I just finished slotting them and then wrestling with Gentoo’s understanding of where all the hardware was. it was a lot of fiddling with the gentoo kernel config, and installing the nvidia drivers, but after all of that was done, the system booted up successfully! I’ve now got it back in its residence connected up to the UPS power, about to shunt docker containers back to the newly improved machine with 2x the CPU capacity.

Was a wild ride, but the cool part of it was when the system shat itself it was part of a 3 node Docker Swarm and I had recently migrated to a NAS for persistence of my container data. though the other 2 nodes aren’t as overbuilt as this thing, so I did have to do some memory wrangling and disabling my lower priority services in order to restore service, but I was able to ensure all necessary services were able to run during the outage, and I got some learning in regards to a couple of the services that didn’t port as cleanly as I would’ve liked. all in all fun times in system administration! lol.

SpikesOtherDog@ani.social · 3 months ago

Wow, it’s super weird to have a system catastrophically fail and kill all the memory like that.

TheRagingGeek@lemmy.world · 3 months ago

Yeah it was pretty crazy I’ve heard ram tends to go obsolete before it dies, but I do have a potential root cause, I did notice after hooking up the new motherboard that the side and back case fans weren’t plugged in, they were routed through a cheap case rgb controller board which must’ve fallen off the tape in the back side of the case, so I’m guessing thermals took them out and 1 just happened to survive(likely the one furthest from the CPU)

SpikesOtherDog@ani.social · 3 months ago

Woof, good theory. There is (or at least used to be) a section of board next to the CPU that controls the voltage, inherently controlling the internal clock speed and the FSB. As the temps increased, resistances and compositions possibly changed, causing that area to perform out of spec, theoretically damaging the board, CPU and memory.

Just to be safe though, I would not trust that PSU either. Another idea is that the PSU was malfunctioning and sent inconsistent voltages, resulting in a multi-component failure.

If you were a business that could afford to gamble on it I’d say risk it. $50 is a low cost for peace of mind. That PSU can go to local market at $10, with a disclaimer.

TheRagingGeek@lemmy.world · 3 months ago

The power supply is a bit on the older side it’s a TX650 gold. So it will probably be the next thing my platform sees in terms of update

SpikesOtherDog@ani.social · 3 months ago

$60 seems to be the comparable price. I’d buy now before there is a run on PSUs.