Few days ago I've received from a friend an external HDD drive that has gone wild yet again. Previously it happened about half a year ago, and it now seems to have established its frequency. He uses it as a backup device for most of his personal archives and he handles the drive with extra care: no physical damage, always safely unmounting, and so on. The disk is then held in a box on the shelf (no nearby magnets:) ) and after some time.. it just refuses to work.
Actually, the disk seems to work, but WindowsXP refuses to see its partitions and data and asks to 'format the disk before use'. Obviously not a thing to do on a archives ;)
The damage and quick fix
The drive in question is ADATA NH92, external USB case, USB 2.0, inside sits a 500G 5400 rpm, 8Mb cache. It is set up with a huge FAT32 partition. Not very safe for long term archives, so the first time I heard about it and got the drive I suspected the worst.
I've examined the drive's contents as thoroughly as possible and actually the file system was not damaged at all. Both FAT copies were identical, all directories were in perfect condition, nothing trucated, all files were readable, not crosslinked and seemed OK.
The only thing that was damaged each time was the BootSector. Curiously, it was overwritten not with random trash but it got almost zeroed'out. Almost, because there was some apparently structured short data at the beginning of the block.
I suspected that some antivirus or cleanup utility tried to help my friend to recover the disk and failed, but the data block did not look like anything that could be a reasonale BootSector, so a virus, maybe? But immediately after the damaged BootSector a copy of it was left intact, so that would have to be a very friendly virus..
Restoring the BS from its backup immediately revived the disk and Windows stopped complaining about formatting and displayed the contents. At first I've done it manually, but it has happened already three or four times so later I've been using the TestDisk (http://www.cgsecurity.org/wiki/TestDisk). With it, restoring BS from its backup is really quick, in menu look into 'Advanced' and then 'Boot'. If you are going to use it, keep in mind that you should first check what is damaged. Your drive may have different problems.
Problem and partial diagnosis
Unfortunatelly, I was all too happy about reviving the drive and I've not preserved backup copies from all the earlier incidents to compare what was actually written to the BS, but I remembered one thing: the whole sector was almost zeroed-out and contained a few bytes of data with a "USBC" string.
The current fault has again looked like this, here's the dump:
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00007E00 55 53 42 43 40 C4 45 89 00 04 00 00 00 00 0A 2A USBC@ÄE........*
00007E10 00 00 00 00 3F 00 00 02 00 00 00 00 00 00 00 6A ....?..........j
00007E20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00007E30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
........ .... zeroes ...
00007FC0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00007FD0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00007FE0 00 00 00 00 72 72 41 61 A1 D2 E8 00 03 00 00 00 ....rrAa.Òè.....
As you see, that's nothing like random trash or a middle of binary file. It looks like some catch-eye magic string, some header, and then a series of a few small integers. So, I hunted after the USBC magic string and it turned out to be ...
Wait. What? SCSI over USB packet?!
The docs for USB Mass Storage Bulk Transfer, page 13 has confirmed: the 0x55534243 is the signature of a CBW packet, which is USB wrapper for an SCSI command. Dissecting the data visible above we get:
0x55534243 ('USBC') - dCBWSignature, the CBW identifier
0x40C44589 - dCBWTag, a passthrough echo
0x00040000 - dCBWDataTransferLength, bulk transfer bytes [256kB]
0x00 - bmCBWFlags [dir = 0: from host to device]
0x00 - bCBWLUN [device = 0]
0x0A - bCBWCBLength, wrapped command length [10B]
and following 10 bytes are the 'CBWCB', the original message to the device that was wrapped.
The disk is internally a simple 2.5" drive, so - the payload is just a SCSI write command:
0x2A - operation code "write at LBA(10)"
0x00 - flags: WRPROTECT=0, DPO=0, FUA=0, FUA_NV=0
0x0000003F - LBA address: 0x3F [!]
0x00 - group number=0
0x0002 - transfer length: 2 blocks
0x00 - control byte
Now look at the LBA in that command.
The BootSector that was damaged was at offset 0x7E00 so at sector 0x3E.
The 0x3F sector is the BackupBootSector.
This write command was meant to write at sector 0x3F.
This write command contains 'transfer length' of two blocks.
The BootSector and its copy should be always exactly the same, so if the Backup were to be written, then probably the normal BootSector was meant to be refreshed too. That means that the BootSector was not overwritten accidentaly. The BootSector was meant to be updated, and then, probably immediately, the BackupBootSector was meant to be updated too.
But how could the whole write command get written to the drive, even with its USB wrapper, instead of being executed? And in such way that no other sectors were damaged?
I think that the answer lies somewhere in the fact how BulkTrasfer works. Looking at raw data stream, it's be something like this:
... | command#0 | bulk data for command#0 | command#1 | bulk data for command#1 | command#2 | ...
Command#0 would be "write-a-BootSector" and command#1 would be "write-a-BackupBootSector". Transfers are performed in larger blocks and commands are short, so each command is padded to fill at least whole block. That way, the drive's controller may just read block-by-block and either read it as a command, or pass it further as block of data to process. To know how many blocks to pass, each command holds information on how long its attached bulk data is.
Now, consider what would happen if for some reason the bulk data for first command is missing. The controller inside the external drive would get:
... | command#0 | command#1 | bulk data for command#1 | command#2 | ...
It would fetch the command#0, read it as "write-a-block-at-BootSector" and then it would treat the next blocks as the anticipated bulk data to be written ... and it would consume the immediate next command as data and leave the bulkdata#1 unread. Then, the bulkdata#1 would be consumed as a next command. In an optimistic case, it would fail and everything would get out of sync and the communication would probably be discarded and reinitialized. In pessimistic case, the bulkdata may look like a command and even further damage could occur.
Currently, I do not know if such out-of-sync errors are reported anywhere and how to check for them. I also have no idea how a block of bulk data could evaporate. But, for me, it just looks like it did. It certainly did not evaporate at random, since this problem occurred many times with this external drive. Moreover, if it happened at random, it would likely happen all over the place, not just at the first BootSector!
Ok, so if not at random, let's consider the second extreme: a systematic error.
Writing to the BootSector and to the BackupBootSector are most probably performed in exactly the same way, just the LBA address is 3E or 3F. Now consider that both operations have their bulk data systematically discarded:
... | command#0 | command#1 | command#2 | ...
That would simply write the "command#1" as the new BootSector, leave the BackupBootSector as-is, and happily proceed with the next command. That way absolutely no errors would be observed. Considering that the operating system is glad that everything went well, the disk would operate normally until it is unplugged and then in the next day it would be dead. Or, if the system flushed and reloaded the disk's configuration, it would immediatelly drop dead.
Disclaimer
I've found the USBC marker, I've analyzed the data, it matches the CBW/SCSI command. All the rest is just my guessing.
For me, it seems like some bulkdata were discarded, but I do not know whever it was the WindowsXP's faulty DeviceDriver, the USB chips on the mainboard, or the USB->drive adapter that sits inside ADATA NH92 aluminium case, or maybe the drive's internal electronics. Well, actually we can cross out the drive's electronics, since under normal operations the adapter should translate USB/CBW/SCSI message into just SCSI message to be sent to the drive. This leaves the driver or the adapter.
Therefore, the next thing I'll suggest to the friend of mine is to buy some new USB adapter with a new case not NH92 and even better not from ADATA, put the disk into it and observe how it behaves. If there will be any similar fault again, that would mean that either the mainboard's USB controller or the WindowsXP's USB MassStorage device driver are faulty.
However, I doubt. I am quite sure that the problem is in the USB adapter that comes with the ADATA NH92. My friend uses many other USB storage devices like pendrives or cameras that expose their internal memory as a usb storage device, and the problem occurs only with that single ADATA device.
Scale of the problem
ADATA NH92 seems to have problems. Searching over the internet I've quickly found many complaints about loosing data, for example see the review's comments here. This one is in Polish only, but lots of similar can be found. Many claim that the disk worker properly and at some point of time it just "died" and asked to be formatted. Some even attempt to diagnose, and they find that i.e. BootSectors were overwritten with USBC marks :P
But, there are more serious cases. For example here at SuperUser, a user of this drive describes that he found out many more sectors were overwritten with USBC marks.
As I recalled that at first I thought that maybe some virus has damaged the drive, I've searched for the "USBC" in a bit broader sense. I've found many, really many posts on various forums that complained about some "USBC virus" that would disable the drives or damage the data. Most of them could be summarized as follows:
- drive wants to be formatted, critical sectors were overwritten with 'USBC'
- contents of a directory has evaporated, a large single USBCxxxx file showed up (where xxxx is some strange characters)
- a file got damaged, its contents were overwritten with "USBC" at some point
Michal's post and search results worry me greatly.
First, in his ADATA NH92 the damage scheme was similar, but occurred at random locations. I'm currently running a low level search over the whole drive of my friend for USBC marks, but nothing was found yet. Maybe this drive was lucky or maybe the Michal's from Stackoverflow was not.
Second, the internet fora indicate that this problem obviously occurs not only in the ADATA NH92. People have reported the same problem with other external drives and card readers. This might indicate some serious problem in a whole batch of USB hardware controllers used in cheap adapters (quite probable!) or even those more expensive ones used in mainboards (unlikely) or a hideous bug in the device driver.
I'd say that under these observations the 'device driver' options is quite reasonable, but I think I've seen posts about "USBC" problems written by people using Linux, so it's hard to put all the blame on WindowsXP's drivers ;)
FYI: ADATA NH92 adapter chipset is Moai MA6116F6 422A-1035 MAN09088.1 backed by 24C02 eeprom.
tl;dr
I'm no expert. If you have similar problems, I suggest you to first consult someone who is. If you cannot, or cannot aford, then you might do the same as I told my friend: buy another USB adapter with plastic or metal case. Just an adapter and a case, you don't need a new disk. Then move the disk from the old case (i.e. NH92) into the new one, and put your old adapter aside. If the problem never repeats for some reasonable time - trash the old adapter and be happy. If the problem repeats - that's mainboard or your operating system, fix that and you can put the old adapter to use with some other disk - or it is the new adapter having the same fault. Maybe try with yet another one? Still cheaper to test than to buy or repair a mainboard..
Final note
I am quite convinced that this is the adapter's fault, but I may be wrong. If you know more about how/why the command got written to the disk instead of data, drop me a note or link!