Suspicious nfs kernel module crash

I routinely mount an NFS volume from my storage server, keymaster. I do it over VPN (I use openvpn) when I’m away from home, like at work or elsewhere.

Today I was trying to save a file to my NFS mounted share, and wasn’t able to. Trying to list the contents of the mounted share returned an error. I logged in, and discovered a following in /var/log/messages:


Apr 29 17:40:18 keymaster kernel: [179629.329499] ------------[ cut here ]------------
Apr 29 17:40:18 keymaster kernel: [179629.329520] WARNING: at /build/buildd-linux-2.6_2.6.32-31-i386-qYaaJr/linux-2.6-2.6.32
/debian/build/source_i386_xen/fs/buffer.c:1160 mark_buffer_dirty+0x20/0x7a()
Apr 29 17:40:18 keymaster kernel: [179629.329530] Hardware name:
Apr 29 17:40:18 keymaster kernel: [179629.329535] Modules linked in: ...
Apr 29 17:40:18 keymaster kernel: [179629.329784] Pid: 2408, comm: nfsd Not tainted 2.6.32-5-xen-686 #1
Apr 29 17:40:18 keymaster kernel: [179629.329791] Call Trace:
Apr 29 17:40:18 keymaster kernel: [179629.329808]  [<c1037799>] ? warn_slowpath_common+0x5e/0x8a
Apr 29 17:40:18 keymaster kernel: [179629.329820]  [<c10377cf>] ? warn_slowpath_null+0xa/0xc
Apr 29 17:40:18 keymaster kernel: [179629.329832]  [<c10d68cc>] ? mark_buffer_dirty+0x20/0x7a
...
Apr 29 17:40:18 keymaster kernel: [179629.330370]  [<f8fe6ec2>] ? svc_process+0x3be/0x5b8 [sunrpc]
Apr 29 17:40:18 keymaster kernel: [179629.330391]  [<f9157754>] ? nfsd+0xd3/0x112 [nfsd]
Apr 29 17:40:18 keymaster kernel: [179629.330410]  [<f9157681>] ? nfsd+0x0/0x112 [nfsd]
Apr 29 17:40:18 keymaster kernel: [179629.330423]  [<c104b348>] ? kthread+0x61/0x66
Apr 29 17:40:18 keymaster kernel: [179629.330433]  [<c104b2e7>] ? kthread+0x0/0x66
Apr 29 17:40:18 keymaster kernel: [179629.330446]  [<c1009b07>] ? kernel_thread_helper+0x7/0x10
Apr 29 17:40:18 keymaster kernel: [179629.330455] ---[ end trace e2458c053130f111 ]---

Now that looks like a nfs kernel module has bitten the dust. A quick Google search suggested running memory test, and that’s what I did. Since I recently installed Debian on keymaster, I did:

root@keymaster:~# aptitude install memtest86

Nicely enough, it re-configured grub automatically, so I just rebooted, and chose the memtest86 boot entry.

23% into the test I saw Unexpected Interrupt, Halting CPU0, and a register dump:

memtest86 on a broken ram

So, if you see a more than suspicious crash, be sure to check your RAM before digging deeper into it.

Update:

My first thought was that it was a memory problem. I bought myself another 1GB of RAM. But when I put it into the server, and booted into the memtest, good heavens! Same thing!

Now the only thing left, I thought, was to replace the motherboard. Heck, I hacked it in so many ways that no wonder if there were some memory lines shorted or God knows what. I found another el-cheapo temporary hardware (I want to buy something real some time, but since I got rid of my Netfinity 6000R rack beast, I’m more reluctant to do so). It had a dual core 2.8 GHz Pentium 4, and a nice quiet desktop case, with SATA already on board so I didn’t have to wreck havoc (SATA PCI controller manual expansion – fail) to find a spare socket.

Much to my amusement, on the new hardware both the memories failed the test. Now I was speechless. Is my methodology not right, or do I just EMP to dust everything I touch? After thinking about it for a while, I decided that the only weak link in the chain of deduction is the memtest.

I discovered that apart from memtest86 there is also memtest86+. I decided to give it a try, and surprisingly it ran fine on both the memories, and both the boxes… I was able to return the 1GB of ram back to the store, but I kept the new box. I run it in good shape so far, so I guess it was some hardware fault of the previous server guess it’s a software triggered (unnecessary) hardware replacement.