I’ve been waiting for a while to get a new desktop computer at home. Back in 2006, I was biding my time to replace my dead desktop until the Windows Vista
buy your computer now and we’ll give you Vista whenever it comes out program to be officially announced by Dell, and finally bought my OptiPlex 745 Small Form Factor PC (with a nearly top-of-the-line Conroe Core 2 Duo E6600 CPU) in October. Since that date, I’ve been running said computer nearly 24 hours a day, seven days a week, and it’s held up remarkably well.
While I ran Windows XP full time when I got it, I switched to Ubuntu full time (aside from playing through Portal when it came out) whenever I got my hands on Vista. That’s not to come as a condemnation of Vista, as everyone is so quick to do, but it just worked out that way: since I was going to do a fresh install on a fresh hard drive for Vista, it was a perfect opportunity to install Ubuntu alongside. And once Ubuntu was installed, I wanted to keep using it (as I had been full time in the months between my old desktop dying and purchasing that new computer). I preferred the desktop environment, window management, and most of the apps were either the exact same ones I used already on Windows (e.g. Opera, Firefox, Pidgin), or had suitable (if not preferable) replacements, with the exception of Foobar2000 and Media Player Classic (MPC). Things worked well, too!
The long wait
Fast forward to late 2011, and I started to get the upgrade itch. Part of that was inspired by the desire to play the (then-)upcoming Portal 2 game, which I was confident my hardware wouldn’t be able to handle, and the other part was just my inner geek. I’d certainly gotten my money’s worth out of the system that’d passed its five-year anniversary of arriving on my desk, and technology had advanced quite a bit, mostly in terms of CPU speed and storage technology (i.e. SSDs). I knew that Intel’s roadmap had another microarchitecture—Ivy Bridge—due up around the end of 2011, or early 2012, depending on which rumors and schedule slips were in the press at a given time, so I decided I’d wait for that.
The retail availability of the first Ivy Bridge desktop CPUs came and went, and there were scant few desktops announced. (My interest in building my own PCs died along with my last self-built desktop, very inconveniently timed to coincide with my last few weeks of college–hence the Dell in October 2006.) Lenovo was first to the punch with some announced ThinkVantage refreshes on Ivy Bridge, but my stellar experience with the OptiPlex 745 caused me to hold out for Dell’s OptiPlex refresh. In early-to-mid-June, that finally happened, and I ordered my own OptiPlex 9010 Small Form Factor (with i7-3770) a few days later.
The machine arrived on a Monday (a few days ahead of schedule—woo!), but I had Monday Night Dinner to attend, so I couldn’t play with it any more than to open the box and plug in the RAM (Crucial Ballistix 2x 8GB DDR3 1600) and SSD (Intel 520-series 180 GB) I bought separately. I don’t think I had time to turn the machine on until Wednesday or Thursday, at which point I set about shrinking the Windows partition (which went fine) and installing linux (Ubuntu 12.04). Unfortunately I overcomplicated things a bit, and was a bit naive, so on my first attempt I neglected to provide an EFI partition at the beginning of the SSD, which caused me quite a bit of confusion until I realized that was necessary. Rather than trying to shift data and partitions slightly (since I assume the EFI partition is supposed to go at the beginning of the drive? Maybe that’s not true…), I just installed again from scratch, since it’s a pretty painless procedure. After that, things were peachy. The machine was fast! And I wasn’t using it enough yet to be terribly mad at Unity yet! But then things went south…
At some point when I was doing an aptitude install of some packages, the machine locked up hard. No disk activity indicator ever blinked on any key presses; I couldn’t ctrl-alt-F1 to another TTY; I couldn’t Alt-SysRq-<whatever> to sync the filesystem, etc. This is clearly not what you want to happen on any system, let alone a new system. The same thing happened again another time when I was using the Ubuntu Software Center to install a package. It happened again when browsing a web page. It happened three more times when I was doing an rsync over the LAN.
I ran the included Dell boot diagnostics, which turned up no errors. (I actually first tried Ubuntu’s hard-drive-installed Memtest86 instance, but that wouldn’t boot, complaining about ‘linux16′ not being a valid boot option, or something.) I later put Memtest86+ 4.20 on a USB stick and ran that, and got some memory errors, but initially suspected it might be an incompatibility with the new Ivy Bridge chips, so didn’t put too much thought into it. That led me to find the Memtest86+ 5.00 Beta 1 claiming to have added support for Ivy Bridge. Hooray! Right? It did successfully detect the SPD information for my RAM, unlike 4.20, but still turned up memory errors. I took out the stick of RAM the results indicated was bad … and still got errors. Ok, so maybe I took out the wrong one, so I swapped them, and … still got errors.
Much testing later, I finally determined what I believe to be some sort of perverse pattern of behavior: When Ethernet is plugged in, Memtest86+ will display errors, when run from a cold boot. If Ethernet is unplugged, Memtest86+ does not show any errors. When run not from a cold boot, Memtest86+ does not display errors (at least, not in the place that it consistently does during all my other tests—I haven’t, for instance, let it run for hours).
What to do now? Some folks on the ArsTechnica #linux channel are convinced that I just got a lemon, and some lead from the Ethernet controller was laid poorly in the PCB, and is interfering with the memory controller. Based on their advice, I contacted Dell support, and they’re shipping me a new system.
I’m less optimistic that the problem I’m seeing is a single-instance hardware issue. Although I have little to base my theory on, my current musing is that this is a systemic issue, possibly with the vPro implementation or the BIOS/EFI. My primary support for this is the fact that, in Windows, I’ve seen no instability during the ~couple-hour ~1.5 MB/s download of Portal 2, web browsing, installing Windows Updates, etc. If you’ve ever read Michael Matthew Garrett‘s stuff, you might be familiar with the craziness that can exist in BIOS / EFI, and the poor state of any reasonable implementation, testing, or support outside of whatever happens to work in Windows. In Windows, of course, there are lots of Intel-written drivers loaded, an Intel vPro/AMT management toolkit that runs in the system tray to interface with the vPro junk, and who knows what other kind of voodoo going on. In Ubuntu, yes, there’s some Intel-written driver code in the kernel, but the 3.2.0-23 kernel I’m running at the time of this writing is a bit outdated relative to whatever fixes they might’ve put into the current stable 3.4.4 or 3.5-rc6. I think the amount of testing the desktop driver code receives on Linux is probably laughably small compared to their Windows code (which is probably laughably small compared to maybe someone like Nvidia). I also have no idea if there’s any Linux desktop code to handle the vPro stuff. I may be in a bit over my head.
I’m not even sure how to start debugging or diagnosing this issue, which I assume will present itself again in 3-8 days, when the replacement system arrives. Writing down what I’ve done so far, so I can link others to this when I cry to them for help is all I can think of at the moment. I’ve not been able to turn up any other useful google results for
OptiPlex 9010 and Linux or Ubuntu or Memtest86, so now hopefully at least this puts the issue on the radar, in case anyone else is running into a similar problem.
(As an afterthought, I do want to mention that the OptiPlex 9010 is actually Ubuntu Certified … for
pre-install only of 11.10, anyway. Dell naturally only offers that configuration in
certain regions, and I’m relatively confident that gives me approximately zero traction with them for my issue…in the US…on 12.04.)
Edit: A Few Other Notes
As I was running up against my desired bed-time last night as I composed this, I rushed a bit to publish, and have since thought of a few other details from my experience so far.
- I have not been able to reproduce a lock-up using nc -l 9999 > /dev/null or nc -l 9999 > /some/file/on/the/hard/drive with the other side of a gigabit link piping /dev/zero or /dev/urandom into netcat (using tcp or udp). My theory behind this is that maybe netcat is too simple of an operation, so it’s not exercising the cpu or memory enough that the rest of the system trips over it? I don’t feel especially satisfied by that explanation, though.
- I have not been able to reproduce a lock-up doing an rsync from my ssd to the spinny hard drive (both internal, sata); I wanted to try to rule out disk activity from being the culprit, so that was my attempt to isolate the symptom a bit further. I don’t know if it’s meaningful.
- I also got a lock-up one time immediately after I copied text from Thunderbird and was moving my mouse to paste it into Firefox. At the time, the only other active
desktopprocess I’m aware of was Banshee playing music off a cifs mount (though it’d been doing that with the computer otherwise idle for a few hours up to that point).
- Two times I’ve experienced what seems to be a drop in all my active network connections: SSH sessions unrecoverably stop responding, music playback from a cifs mount stopped, and a Remmina rdp session unrecoverably stopped responding. The only evidence in any logs I can find that something went wrong is: CIFS VFS: Server sark has not responded in 300 seconds. Reconnecting… Update: I’m pretty sure this was caused by my computer’s IP address changing. I noticed that my computer had progressively been marching up through the DHCP lease pool, rather than sticking to its originally allocated address like it should. It’s my belief this is due to vPro/AMT trying to poke its head up and get a DHCP lease using the same MAC address. Further investigation is merited. I also don’t know if this means anything for the bigger problem of the lockups.
- Special thanks to some Ubuntu guys who have come to my aid (compliments of johnf)!
Updated: The replacement system arrived, and it appears that maybe I’m running into a BIOS/UEFI bug?Tags: Dell OptiPlex 9010, error, linux, ubuntu