I got this nice message (and some others) in my log today:
smartd: Device: /dev/sdd, 2 Currently unreadable (pending) sectors
After running a smartctl -t long /dev/sdd smartctl now reports:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8719
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
[...]
SMART Error Log Version: 1
ATA Error Count: 10 (device log contains only the most recent five errors)
[...]
Error 10 occurred at disk power-on lifetime: 8692 hours (362 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 78 e5 89 e0 Error: UNC 8 sectors at LBA = 0x0089e578 = 9037176
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 78 e5 89 2f 00 45d+06:36:27.880 READ DMA EXT
25 00 08 70 e5 89 2f 00 45d+06:36:27.798 READ DMA EXT
25 00 08 68 e5 89 2f 00 45d+06:36:27.308 READ DMA EXT
25 00 08 e0 77 e5 35 00 45d+06:36:27.295 READ DMA EXT
25 00 08 60 e5 89 2f 00 45d+06:36:26.653 READ DMA EXT
[...]
Error 9 occurred at disk power-on lifetime: 8692 hours (362 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 60 e5 89 e0 Error: UNC 8 sectors at LBA = 0x0089e560 = 9037152
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 8715 797566305
# 2 Extended offline Completed without error 00% 8198 -
Notice that 197 Current_Pending_Sector indicates that there are 2 sectors with a problem.
Note that the SMART error log shows LBA sectors 9037176 (0x89e578) and 9037152 (0x89e560) but it's limited to 0xffffff. The self test logs shows us the proper LBA error at 797566305 (0x2f89e561) which has a 0x2f in front, and also shows us that it's the second sector in the block of 8 we tried to read.
The kernel log also shows:
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata4.00: (BMDMA2 stat 0xc0009)
ata4.00: cmd 25/00:08:60:e5:89/00:00:2f:00:00/e0 tag 0 cdb 0x0 data 4096 in
res 51/40:00:61:e5:89/00:00:2f:00:00/f0 Emask 0x9 (media error)
So I've tried to reproduce the error with:
# dd if=/dev/sdd of=797566305 skip=797566305 bs=512 count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes (512 B) copied, 1.27727 s, 0.4 kB/s
Notice that it can actually read that block, it just seems to take a while.
And the error log now contains:
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 61 e5 89 e0 Error: UNC 1 sectors at LBA = 0x0089e561 = 9037153
Because it's part of a RAID device I could just let the whole disk resync, but I was a little curious which files were in it. So I found this document that explains on how to find which files has the problem. But it only has examples covering ext2/ext3 on a partition and LVM and nothing about RAID device. So it looks like someone using RAID 0 will have a hard time finding the file that's having a problem.
So following the document we start with:
# fdisk -lu /dev/sdd
Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x000578f9
Device Boot Start End Blocks Id System
/dev/sdd1 * 63 385559 192748+ fd Linux raid autodetect
/dev/sdd2 972864270 976768064 1951897+ fd Linux raid autodetect
/dev/sdd3 385560 972864269 486239355 fd Linux raid autodetect
Partition table entries are not in disk order
In /proc/mdstat we see:
md1 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
1458717696 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
So sdd3 is part of md1 and I have an ext3 filesytsem on that. Tune2fs -l |grep Block gives:
Block count: 364679424
Block size: 4096
Blocks per group: 32768
mdadm tells me:
# mdadm -QE /dev/sdd3
/dev/sdd3:
Magic : a92b4efc
Version : 00.90.00
UUID : 747b071a:a70f723b:c3a9ed44:15845cbf
Creation Time : Mon Nov 12 20:13:51 2007
Raid Level : raid5
Used Dev Size : 486239232 (463.71 GiB 497.91 GB)
Array Size : 1458717696 (1391.14 GiB 1493.73 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Update Time : Mon Nov 10 10:57:28 2008
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : 4b3147f4 - correct
Events : 78
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 51 3 active sync /dev/sdd3
0 0 8 3 0 active sync /dev/sda3
1 1 8 19 1 active sync /dev/sdb3
2 2 8 35 2 active sync /dev/sdc3
3 3 8 51 3 active sync /dev/sdd3
I've been told that 0.90 stores the superblock at the end of the device, so that makes it a little easier. And this seems to agree:
# cat /sys/block/md1/md/dev-sdd3/offset
0
We have a left-symmetric layout, so in raid5_compute_sector() we see
stripe = chunk_number / data_disks;
*dd_idx = chunk_number % data_disks;
[...]
case ALGORITHM_LEFT_SYMMETRIC:
*pd_idx = data_disks - stripe % raid_disks;
*dd_idx = (*pd_idx + 1 + *dd_idx) % raid_disks;
So it looks like the chunks are located on the disk like:
stripe | sda | sdb | sdc | sdd |
0 | D 0 | D 1 | D 2 | P 0-2 |
1 | D 4 | D 5 | P 3-5 | D 3 |
2 | D 8 | P 6-8 | D 6 | D 7 |
3 | P 9-11 | D 9 | D 10 | D 11 |
4 | D 12 | D 13 | D 14 | P 12-14 |
With each of those 64K.
The problem is at 797566305 and the partition started at 385560. So that's at sector 797180745 inside partition sdd3. Each sector is 512 bytes and a chunk is 65536 bytes which gets us 128 sectors per chunk. If we want to know which the stride the sector is on we do: 797180745/128=6227974.5703125
This means that the stripe is number 6227974 and that it's sector 0.5703125*128=73 on sdd3 within that chunk.
The parity is on disk: pd_idx = 3 - 6227974 % 4 = 1, being sdb.
To get the chunk we need to multiply with the number of data disks (3) and since it's the forth disk and the parity is on the second disk we need to add 1 and get: 6227974*3+1=18683923
So calculating that back to sectors we multiply it with the number of sectors per chunk again and add our offset in the chunk: 18683923*128+73=2391542217
To test that our math is any good we try:
# dd if=/dev/md1 of=797566305.2 skip=2391542217 bs=512 count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.99103 s, 0.1 kB/s
We see an error message in the kernel log, there are 2 new errors in the error log, so it seems that everything is calculated correctly. The files are also identical.
The file system is in blocks of 4096 bytes, so for the file system this is block 2391542217*512/4096=298942777.125
Then we use debugfs:
# debugfs
debugfs 1.41.2 (02-Oct-2008)
debugfs: open /dev/md1
debugfs: icheck 298942777
Block Inode number
298942777 <block not found>
Which isn't making any sense to me, since I get that error about every 2 hour when a certain cronjob runs. 298942777 / 32768 (Blocks per group) = 9123.00955. So I wonder if ext3 is keeping some meta data there and debugfs doesn't tell that?
Anyway, at some point I was running a smartctl -t long /dev/sdd again because I couldn't reproduce the error anymore, the cronjob started, and I then found alot of errors in my log again and now it included:
raid5:md1: read error corrected (8 sectors at 797180744 on sdd3)
It's so much fun that disks always behaves the same trying to read the same sector.
Smartctl now reports:
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
I assume the kernel removed the pending sector by writing to it. The other pending error was already gone for some time, I assume new data got written to that one.