mdlbear: (technonerdmonster)
[personal profile] mdlbear

Early last Sunday afternoon I noticed that the battery-charge indicator had vanished from (main laptop)Sable's Gnome panel. (That's sort of like the row of icons and such you see along the bottom of the screen on a Mac, except that I've configured it to go vertically down the left-hand edge, where it doesn't reduce the hight of my browser window too much.)

Hmm, says I to myself, maybe it will come back after a reboot. So I did that, and logging in presented me with an empty screen background. ??? A little more experimentation showed that only the Gnome-2 desktop was affected; the Ubuntu one (which I detest) worked fine. So did a console terminal, and SSH. The obvious next step was to run fsck, the file-system checker (and many hackers' favorite stand-in for a certain four-letter expletive).

Well, not quite the next step. Since I figured that fixing file-system corruption might possibly make things worse, I moved over to one of my spare laptops, Raven, sat Sable on the shelf next to my desk, and logged in on Sable with SSH. Then I went to the top of my working tree and ran make status to see what needed to be checked in. I think I've mentioned MakeStuff before -- it's basically a multi-function build tool based on GNU Make, and one of the things it can do is find every git repository under the top-level directory, and do things like check its status, or pull. (Commit takes a little more thought, so you don't want to do it indiscriminately.)

Then I ran MakeStuff/scripts/scripts/pull-all on Raven. Done.

Well, almost. There are a few things in my home directory that aren't under my working tree, mostly Desktop, Documents, Downloads, my Firefox bookmarks, and my Gnome Panel configuration. I hauled out a USB stick, fired up tar (like zip, except that it can save everything about a file, not just what DOS knows about). The command I actually used, because I probably forgot a few things (and should have excluded a few more, like Ruby and Perl), was

    rsync -a --exclude vv --exclude ?cache --exclude ?golang . \
          nova:/vv/backups/steve\@sable

And ran straight into the fact that USB sticks are usually formatted with a FAT filesystem, and limit files to 4Gb. Growf! Faced with the unappetizing prospect of shipping 17GB of backups over WiFi, I carried Sable over to my server and plugged in the ethernet cable that I leave hanging off the router for just such occasions. After that finished, I fired up Firefox bookmarked all my tabs, and exported tabs and bookmarks to an HTML file. Should have done that before I backed up everything, but I didn't think of it.

Finally, I was ready to run fsck and find out the bad news. I plugged in the USB stick with the Ubuntu live installer (one does not run fsck on a mounted filesystem!), brought up a terminal, and ran

e2fsck -cfp /dev/sda5 # check for bad blocks, force, preen

(Force means to do a full check even if the disk claims it's okay; "preen" means to make all repairs that can be done without human approval.) Naturally, after turning up a few dozen bad blocks, it told me that I had to run it manually. I could have replaced the -p option with -y, to say "yes" to all requests for approval; instead I left it off and hit Enter a hundred times or so. Almost all the problems were "doubly-claimed blocks", mostly shared between some other file and the swapfile. Of course. Fsck offered to clone those blocks, and I took it up on that offer. Then ran it again to make sure it hadn't missed anything. It hadn't. But it was still broken, no doubt because of all those corrupted files.

So this morning, after a couple of searches, I installed the debsums program, which finds all of the files you've installed, and compares their checksums against the ones in the packages they came from. The following command then takes that list, and re-installs any package containing a file with a bad checksum:

apt-get install --reinstall $(dpkg -S $(debsums -c) \
       | cut -d : -f 1 | sort -u)

Sable now "works" again. I know at one zip file was corrupted (it was a download, and I was able to find it again), and fsck doesn't appear to have kept a log, so broken files will keep turning up for a while. I know there aren't any bad zip files left because there's an option in unzip, -t, that compares checksums, just the way debsums does, so I could loop through all my downloads with:

for f in *.zip; do echo -n $f:\ ; unzip -tqq $f; echo; done

I have two remaining tasks, I think: one is to validate all of my Git working trees (worst case -- just blow them away and re-clone them), and then comes the really hard one: deciding whether I still trust Sable's SSD, or need to get a new one. And if I get a new one, how big? Sable and its 500GB drive were purchased together, used, from eBay, and brand-new 1TB SSDs are pretty cheap right now. So there's that.

Another fine post from The Computer Curmudgeon (also at computer-curmudgeon.com).
Donation buttons in profile.

Date: 2020-08-27 12:30 pm (UTC)
siliconshaman: black cat against the moon (Default)
From: [personal profile] siliconshaman

You know.. I think I would've gone with backing up everything unique, and then just reinstalled a fresh copy of linux after reformatting the drive. But, I'm more of a hack than an expert.

Date: 2020-08-27 07:27 pm (UTC)
technoshaman: Tux (Default)
From: [personal profile] technoshaman
Get the terrorbyte drive. That way you can suck somebody else's drive down without worrying about it.

I'm thinking I need to go to the Elementary 20.04 equivalent soon...

I don't recommend it for you, b/c you like *less* GUI, not more... but I'm digging the 5.0.4 kernel... that with the intel microcode package makes spectre-meltdown-checker turn all green, and that makes me HAPPY.

Date: 2020-08-27 09:56 pm (UTC)
azurelunatic: Teddybear that contains ethernet switch.  (teddyborg)
From: [personal profile] azurelunatic
"Preen" makes me think of some sort of digital phoenix with storage for feathers running its beak over its sectors and nibbling each one gently.

smartctl diagnostics

Date: 2020-08-28 05:58 pm (UTC)
From: [personal profile] andyheninger
I'm sorry to hear that you're having disk troubles. Yet another thing to do on suspected flaky disks is to run the smartctl tool to dump the drive's built-in diagnostic info. e.g.

sudo smartctl -a /dev/nvme0n1p1

(substitute the appropriate device name). If the drive's in serious trouble (no spare sectors, many write errors), it may be better to avoid attempting repairs, but to mount it read-only and copy off whatever you can salvage.

Sample of some of the smartctl output:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    6,518,547 [3.33 TB]
Data Units Written:                 13,816,473 [7.07 TB]
Host Read Commands:                 151,041,403
Host Write Commands:                280,254,360
Controller Busy Time:               2,180
Power Cycles:                       1,365
Power On Hours:                     10,676
Unsafe Shutdowns:                   19
Media and Data Integrity Errors:    1
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0


"Percentage Used" is an interesting number for SSDs. They can do only a limited number of write cycles before failing; when the percentage used reaches 100, all bets are off.

Edit: depending on the SSD brand, "Percentage Used" may be called something else, and the something else may count up or down - 100 may mean brand new or may mean worn out. With luck, the name will imply which way.
Edited Date: 2020-08-28 07:01 pm (UTC)

Date: 2020-08-29 04:52 pm (UTC)
From: [personal profile] andyheninger
although, the normalized values are supposedly scaled so that 100% is good, so who knows?

Yea, who knows. Searching for information on your particular series of SSD might turn up something. Either way, if the drive hasn't been consuming its spare sectors, and hasn't logged errors, it's probably not responsible for the file system corruption.

Which leaves the question of what is responsible. Linux as a whole and its file systems are usually rock solid. Maybe run memtest overnight (from a thumb drive). stress-ng could also be interesting.

Most Popular Tags

Style Credit

Page generated 2025-06-13 07:22 pm
Powered by Dreamwidth Studios