mdlbear | Troubleshooting Sable's SSD

Early last Sunday afternoon I noticed that the battery-charge indicator had vanished from (main laptop)Sable's Gnome panel. (That's sort of like the row of icons and such you see along the bottom of the screen on a Mac, except that I've configured it to go vertically down the left-hand edge, where it doesn't reduce the hight of my browser window too much.)

Hmm, says I to myself, maybe it will come back after a reboot. So I did that, and logging in presented me with an empty screen background. ??? A little more experimentation showed that only the Gnome-2 desktop was affected; the Ubuntu one (which I detest) worked fine. So did a console terminal, and SSH. The obvious next step was to run fsck, the file-system checker (and many hackers' favorite stand-in for a certain four-letter expletive).

Well, not quite the next step. Since I figured that fixing file-system corruption might possibly make things worse, I moved over to one of my spare laptops, Raven, sat Sable on the shelf next to my desk, and logged in on Sable with SSH. Then I went to the top of my working tree and ran make status to see what needed to be checked in. I think I've mentioned MakeStuff before -- it's basically a multi-function build tool based on GNU Make, and one of the things it can do is find every git repository under the top-level directory, and do things like check its status, or pull. (Commit takes a little more thought, so you don't want to do it indiscriminately.)

Then I ran MakeStuff/scripts/scripts/pull-all on Raven. Done.

Well, almost. There are a few things in my home directory that aren't under my working tree, mostly Desktop, Documents, Downloads, my Firefox bookmarks, and my Gnome Panel configuration. I hauled out a USB stick, fired up tar (like zip, except that it can save everything about a file, not just what DOS knows about). The command I actually used, because I probably forgot a few things (and should have excluded a few more, like Ruby and Perl), was

    rsync -a --exclude vv --exclude ?cache --exclude ?golang . \
          nova:/vv/backups/steve\@sable

And ran straight into the fact that USB sticks are usually formatted with a FAT filesystem, and limit files to 4Gb. Growf! Faced with the unappetizing prospect of shipping 17GB of backups over WiFi, I carried Sable over to my server and plugged in the ethernet cable that I leave hanging off the router for just such occasions. After that finished, I fired up Firefox bookmarked all my tabs, and exported tabs and bookmarks to an HTML file. Should have done that before I backed up everything, but I didn't think of it.

Finally, I was ready to run fsck and find out the bad news. I plugged in the USB stick with the Ubuntu live installer (one does not run fsck on a mounted filesystem!), brought up a terminal, and ran

e2fsck -cfp /dev/sda5 # check for bad blocks, force, preen

(Force means to do a full check even if the disk claims it's okay; "preen" means to make all repairs that can be done without human approval.) Naturally, after turning up a few dozen bad blocks, it told me that I had to run it manually. I could have replaced the -p option with -y, to say "yes" to all requests for approval; instead I left it off and hit Enter a hundred times or so. Almost all the problems were "doubly-claimed blocks", mostly shared between some other file and the swapfile. Of course. Fsck offered to clone those blocks, and I took it up on that offer. Then ran it again to make sure it hadn't missed anything. It hadn't. But it was still broken, no doubt because of all those corrupted files.

So this morning, after a couple of searches, I installed the debsums program, which finds all of the files you've installed, and compares their checksums against the ones in the packages they came from. The following command then takes that list, and re-installs any package containing a file with a bad checksum:

apt-get install --reinstall $(dpkg -S $(debsums -c) \
       | cut -d : -f 1 | sort -u)

Sable now "works" again. I know at one zip file was corrupted (it was a download, and I was able to find it again), and fsck doesn't appear to have kept a log, so broken files will keep turning up for a while. I know there aren't any bad zip files left because there's an option in unzip, -t, that compares checksums, just the way debsums does, so I could loop through all my downloads with:

for f in *.zip; do echo -n $f:\ ; unzip -tqq $f; echo; done

I have two remaining tasks, I think: one is to validate all of my Git working trees (worst case -- just blow them away and re-clone them), and then comes the really hard one: deciding whether I still trust Sable's SSD, or need to get a new one. And if I get a new one, how big? Sable and its 500GB drive were purchased together, used, from eBay, and brand-new 1TB SSDs are pretty cheap right now. So there's that.

Another fine post from The Computer Curmudgeon (also at computer-curmudgeon.com).
Donation buttons in profile.

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 31 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 4% Data Units Read: 6,518,547 [3.33 TB] Data Units Written: 13,816,473 [7.07 TB] Host Read Commands: 151,041,403 Host Write Commands: 280,254,360 Controller Busy Time: 2,180 Power Cycles: 1,365 Power On Hours: 10,676 Unsafe Shutdowns: 19 Media and Data Integrity Errors: 1 Error Information Log Entries: 1 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0

Flat | Top-Level Comments Only

From:

siliconshaman

You know.. I think I would've gone with backing up everything unique, and then just reinstalled a fresh copy of linux after reformatting the drive. But, I'm more of a hack than an expert.

mdlbear

That's probably what I'll end up doing. In one variation or another -- first I have to decide whether I want to upgrade to a terabyte drive, and then whether I want to upgrade to Ubuntu 20.04 (from 18...). Or switch distros again.

technoshaman

Get the terrorbyte drive. That way you can suck somebody else's drive down without worrying about it.

I'm thinking I need to go to the Elementary 20.04 equivalent soon...

I don't recommend it for you, b/c you like *less* GUI, not more... but I'm digging the 5.0.4 kernel... that with the intel microcode package makes spectre-meltdown-checker turn all green, and that makes me HAPPY.

Yeah. So I'm thinking Samsung SSD 860 EVO 1TB -- I haven't had a bit of trouble from the 500 I bought four-and-a-half years ago. The 500 I have will probably get less flaky with a full reinstall, but I'm not going to count on it.

I'm thinking maybe MATE on either Ubuntu or Mint -- my current desktop is XMonad on Gnome Flashback, and I suspect something based on Gnome 2 will be a better fit.

azurelunatic

"Preen" makes me think of some sort of digital phoenix with storage for feathers running its beak over its sectors and nibbling each one gently.

$mdlbear: blue fractal bear with text "since 2002" (Default)$

I can't decide whether I picture it as a raven, a peacock, or a parrot, but that's exactly the ind of operation I think of.

andyheninger

I'm sorry to hear that you're having disk troubles. Yet another thing to do on suspected flaky disks is to run the smartctl tool to dump the drive's built-in diagnostic info. e.g.

sudo smartctl -a /dev/nvme0n1p1

(substitute the appropriate device name). If the drive's in serious trouble (no spare sectors, many write errors), it may be better to avoid attempting repairs, but to mount it read-only and copy off whatever you can salvage.

Sample of some of the smartctl output:

"Percentage Used" is an interesting number for SSDs. They can do only a limited number of write cycles before failing; when the percentage used reaches 100, all bets are off.

Edit: depending on the SSD brand, "Percentage Used" may be called something else, and the something else may count up or down - 100 may mean brand new or may mean worn out. With luck, the name will imply which way.

Edited Date: 2020-08-28 07:01 pm (UTC)

Thanks -- I'd overlooked that one. I have:

    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    202 Percent_Lifetime_Used   0x0030   097   097   001    Old_age   Offline      -       3

... and 97% used looks correct given the thing's age. So it's a good thing the replacement is arriving this afternboon!

... although, the normalized values are supposedly scaled so that 100% is good, so who knows?

although, the normalized values are supposedly scaled so that 100% is good, so who knows?

Yea, who knows. Searching for information on your particular series of SSD might turn up something. Either way, if the drive hasn't been consuming its spare sectors, and hasn't logged errors, it's probably not responsible for the file system corruption.

Which leaves the question of what is responsible. Linux as a whole and its file systems are usually rock solid. Maybe run memtest overnight (from a thumb drive). stress-ng could also be interesting.

My best guess is a crash or unexpected shutdown in the middle of a write. I've noticed some occasional weirdness plugging or unplugging the power cable, which I think I had just done around that time. Or I could blame it on one of the cats.

The Mandelbear's Musings

Troubleshooting Sable's SSD

Profile

Page Summary

Active Entries

Expand Cut Tags

Troubleshooting Sable's SSD

no subject

no subject

no subject

no subject

no subject

no subject

smartctl diagnostics

Re: smartctl diagnostics

Re: smartctl diagnostics

no subject

no subject

Links

Style Credit