technology, amateur radio, scouting and me

Unresponsive Supermicro X10SDV-TLN4F motherboard

Why does a previously working Supermicro X10SDV-TLN4F motherboard suddenly stop booting or responding to the keyboard?

Background

Back in September 2015 I replaced my home server with a Supermicro SuperServer 5028D-TN4T Xeon D-1540. This is an amazing little box:

  • tiny enclosure;
  • Supermicro X10SDV-TLN4F motherboard with single XEON D-1540 CPU and integrated IPMI and KVM;
  • room for four hot-swappable 3.5 inch drives, plus space for another internally.

I configured this server with 16GB of ECC RAM, four 3GB SAS drives and an internal SAS SSD and run SmartOS from a USB Flash drive. The hard drives are configured into two ZRAID-1 mirrors providing 6GB of storage and the SSD is used as a ZFS cache. It’s blindingly fast and suits my needs perfectly.

This server has been running untouched for nn years and apart from a couple of reboots when I upgraded SmartOS has been running peacefully, and fairly quietly, in my garage cum workshop. However, I noticed that the front cover, which is also the air inlet, was collecting sawdust on it so I decided to power down the server and give a good internal clean.

It’s not coming back

I cleaned the server out and plugged in the power and waited, and waited, and waited for the server to come back on-line. Previously, a reboot took 2-3 minutes for the server to come back on-line. This time, nothing.

I attached a screen and keyboard and could immediately see that the server had not booted: it has tried to boot across the network and finally given up. It was now waiting for a boot disk to be added and a key to be pressed.

To be sure of what was happening, I reset the server and watched what was happening. Sure enough, the server never tried to boot from the USB Flash device and went directly to PXE boot: which failed. It was also ignoring the USB connected keyboard. One more reset confirmed that although there was a brief flash of the keyboard LED, after that, nothing worked. No NumLock, nothing.

Debugging

Stage one

I fired up the built in KVM and, unexpectedly, found that this also ignored the keyboard: even the virtual keyboard. I could control the power, reset etc. but not interact with the console.
That meant I couldn’t see any POST messages. All I could do was a POST Snoop, and that came back with 00: i.e. all OK. No errors from the onboard USB controller, so why no USB connectivity. Strange.

Stage two

To try and get control back, I decided to start by reseting the onboard CMOS. This is a bit fiddly on this board when it’s installed in the case, but I managed it.

No change.

Stage three

Check all plugs and connections in case I jogged something when I cleaned the server out.
I had to take the server out of it’s cupboard for this and put it on the bench.

That seemed to work.
This time, the server booted cleanly and came online. Put the case back on, and the server back in the cupboard.

Once again, no boot action.

Stage four

That seems to point at a physical problem. Time to remove and reseat all the boards and memory.

No change.

Time to dig deeper.

Stage five

There’s nothing showing up in the IPMI, however, the BIOS is very old: only 1.0a and I know from This page on tinkertry.com that the current BIOS is 1.1. However, if I can’t boot, then I can’t update the BIOS.

Next step is to see what happens if I remove the USB Flash Media and pull all the drives.

That made a difference. I now have control via the KVM and can get into the BIOS. The finger points at the USB Flash drive initially.

Update later the same day

Well, changing the USB Flash drive seemed to do the trick, though I’m at a loss to understand why. I booted another box with the “faulty” USB drive with no problems; but as soon as I try booting this box, it fails. I guess it’s a tolerance issue or something.

I still don’t understand how a Flash disk failure can cause the iKVM to fail though. However, I don’t have the time to pursue this further right now. I’m just happy that the server is up again.

Here’s to another two years of uninterrupted service before it needs to be powered down.

Sorry, comments are closed for this post.