What’s going on on your servers?

I had to bite the bullet and buy new drives after the old ones filled up. I went for used enterprise SSDs on eBay and eventually found some that had an okay price, even though it’s been much more than last time I got some. Combined with Hetzner’s hefty price increase some month ago, my hobby has become a bit more expensive again thanks to the ever growing appetite of companies building more data centers to churn more energy.

Anyways, the drives are in, my Ansible playbook to properly encrypt them and make them available in Proxmox worked, so that was smooth (ignoring the part where I disassembled the Lenovo tiny from the rack, open it, SSD out, SSD in, close it and put it back in only to realize I put in the old ssd again).

Any changes in your hardware setups? Did the price increase make you reconsider some design decisions? Let us know!

  • TheRagingGeek@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    3 days ago

    so this week I was getting ready for my workday when my Son tells me CraftyController is inaccessible, so I tried to SSH into the box that the service is pinned to… nada, dead. tried to power cycle it, nada.

    now this node was a B450M-A mobo Ryzen 7 2700X platform with some hodgepodge scrap RAM I’ve had running in it(RAM birthday was 2019). I hooked it up to a mini monitor and a keyboard, but it didn’t post at all, so just a blue screen of no signal. unfortunately the B450M-A mobo didn’t feature POST debug lights, nor did it use QLED, it apparently relied on PC Speaker, and my machine wasn’t telling any tales. so since I had no real idea as to the root cause and after reseating the RAM and the GPU and fiddling with it got me nowhere, I got my partner to approve the outspend for replacement of the motherboard so that I could have actual Debug indicators.

    Thursday the ROG B550-F Gaming WIFI II mobo arrived, as did the Ryzen 9 5900XT and the Nautilus 360RS cooler. I spent the evening assembling the mobo and CPU and the GPU, the RAM, and all the related wiring. figured I would do the Cooler the next day. Yesterday I got the cooler in place with some serious hardware acrobatics. I then fired it up and Yellow LED. DRAM issue, so I unseated all of the RAM, plugging in one of the hodgepodge sets(I had 4x8GB ram sticks) neither set worked, went to just trying a single stick. of the 4 sticks only 1 was able to get past the Yellow LED and into completed POST.

    So the RAM was shot and I’m not going to run containers on a machine with only 8 GB of ram. so I ordered up some Vengeance LPX 2x16G sticks and they arrived this morning! I just finished slotting them and then wrestling with Gentoo’s understanding of where all the hardware was. it was a lot of fiddling with the gentoo kernel config, and installing the nvidia drivers, but after all of that was done, the system booted up successfully! I’ve now got it back in its residence connected up to the UPS power, about to shunt docker containers back to the newly improved machine with 2x the CPU capacity.

    Was a wild ride, but the cool part of it was when the system shat itself it was part of a 3 node Docker Swarm and I had recently migrated to a NAS for persistence of my container data. though the other 2 nodes aren’t as overbuilt as this thing, so I did have to do some memory wrangling and disabling my lower priority services in order to restore service, but I was able to ensure all necessary services were able to run during the outage, and I got some learning in regards to a couple of the services that didn’t port as cleanly as I would’ve liked. all in all fun times in system administration! lol.

      • TheRagingGeek@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        20 hours ago

        Yeah it was pretty crazy I’ve heard ram tends to go obsolete before it dies, but I do have a potential root cause, I did notice after hooking up the new motherboard that the side and back case fans weren’t plugged in, they were routed through a cheap case rgb controller board which must’ve fallen off the tape in the back side of the case, so I’m guessing thermals took them out and 1 just happened to survive(likely the one furthest from the CPU)

        • SpikesOtherDog@ani.social
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 hours ago

          Woof, good theory. There is (or at least used to be) a section of board next to the CPU that controls the voltage, inherently controlling the internal clock speed and the FSB. As the temps increased, resistances and compositions possibly changed, causing that area to perform out of spec, theoretically damaging the board, CPU and memory.

          Just to be safe though, I would not trust that PSU either. Another idea is that the PSU was malfunctioning and sent inconsistent voltages, resulting in a multi-component failure.

          If you were a business that could afford to gamble on it I’d say risk it. $50 is a low cost for peace of mind. That PSU can go to local market at $10, with a disclaimer.

          • TheRagingGeek@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            9 hours ago

            The power supply is a bit on the older side it’s a TX650 gold. So it will probably be the next thing my platform sees in terms of update