I routinely mount an NFS volume from my storage server, keymaster. I do it over VPN (I use openvpn) when I’m away from home, like at work or elsewhere.
Today I was trying to save a file to my NFS mounted share, and wasn’t able to. Trying to list the contents of the mounted share returned an error. I logged in, and discovered a following in /var/log/messages:
Apr 29 17:40:18 keymaster kernel: [179629.329499] ------------[ cut here ]------------
Apr 29 17:40:18 keymaster kernel: [179629.329520] WARNING: at /build/buildd-linux-2.6_2.6.32-31-i386-qYaaJr/linux-2.6-2.6.32
Apr 29 17:40:18 keymaster kernel: [179629.329530] Hardware name:
Apr 29 17:40:18 keymaster kernel: [179629.329535] Modules linked in: ...
Apr 29 17:40:18 keymaster kernel: [179629.329784] Pid: 2408, comm: nfsd Not tainted 2.6.32-5-xen-686 #1
Apr 29 17:40:18 keymaster kernel: [179629.329791] Call Trace:
Apr 29 17:40:18 keymaster kernel: [179629.329808] [<c1037799>] ? warn_slowpath_common+0x5e/0x8a
Apr 29 17:40:18 keymaster kernel: [179629.329820] [<c10377cf>] ? warn_slowpath_null+0xa/0xc
Apr 29 17:40:18 keymaster kernel: [179629.329832] [<c10d68cc>] ? mark_buffer_dirty+0x20/0x7a
Apr 29 17:40:18 keymaster kernel: [179629.330370] [<f8fe6ec2>] ? svc_process+0x3be/0x5b8 [sunrpc]
Apr 29 17:40:18 keymaster kernel: [179629.330391] [<f9157754>] ? nfsd+0xd3/0x112 [nfsd]
Apr 29 17:40:18 keymaster kernel: [179629.330410] [<f9157681>] ? nfsd+0x0/0x112 [nfsd]
Apr 29 17:40:18 keymaster kernel: [179629.330423] [<c104b348>] ? kthread+0x61/0x66
Apr 29 17:40:18 keymaster kernel: [179629.330433] [<c104b2e7>] ? kthread+0x0/0x66
Apr 29 17:40:18 keymaster kernel: [179629.330446] [<c1009b07>] ? kernel_thread_helper+0x7/0x10
Apr 29 17:40:18 keymaster kernel: [179629.330455] ---[ end trace e2458c053130f111 ]---
Now that looks like a nfs kernel module has bitten the dust. A quick Google search suggested running memory test, and that’s what I did. Since I recently installed Debian on keymaster, I did:
root@keymaster:~# aptitude install memtest86
Nicely enough, it re-configured grub automatically, so I just rebooted, and chose the memtest86 boot entry.
23% into the test I saw Unexpected Interrupt, Halting CPU0, and a register dump:
So, if you see a more than suspicious crash, be sure to check your RAM before digging deeper into it.
My first thought was that it was a memory problem. I bought myself another 1GB of RAM. But when I put it into the server, and booted into the memtest, good heavens! Same thing!
Now the only thing left, I thought, was to replace the motherboard. Heck, I hacked it in so many ways that no wonder if there were some memory lines shorted or God knows what. I found another el-cheapo temporary hardware (I want to buy something real some time, but since I got rid of my Netfinity 6000R rack beast, I’m more reluctant to do so). It had a dual core 2.8 GHz Pentium 4, and a nice quiet desktop case, with SATA already on board so I didn’t have to wreck havoc (SATA PCI controller manual expansion – fail) to find a spare socket.
Much to my amusement, on the new hardware both the memories failed the test. Now I was speechless. Is my methodology not right, or do I just EMP to dust everything I touch? After thinking about it for a while, I decided that the only weak link in the chain of deduction is the memtest.
I discovered that apart from memtest86 there is also memtest86+. I decided to give it a try, and surprisingly it ran fine on both the memories, and both the boxes… I was able to return the 1GB of ram back to the store, but I kept the new box. I run it in good shape so far, so I
guess it was some hardware fault of the previous server guess it’s a software triggered (unnecessary) hardware replacement.