2013-06-25

ADATA NH92 recurring malfunction

Few days ago I've received from a friend an external HDD drive that has gone wild yet again. Previously it happened about half a year ago, and it now seems to have established its frequency. He uses it as a backup device for most of his personal archives and he handles the drive with extra care: no physical damage, always safely unmounting, and so on. The disk is then held in a box on the shelf (no nearby magnets:) ) and after some time.. it just refuses to work.

Actually, the disk seems to work, but WindowsXP refuses to see its partitions and data and asks to 'format the disk before use'. Obviously not a thing to do on a archives ;)

The damage and quick fix

The drive in question is ADATA NH92, external USB case, USB 2.0, inside sits a 500G 5400 rpm, 8Mb cache. It is set up with a huge FAT32 partition. Not very safe for long term archives, so the first time I heard about it and got the drive I suspected the worst.

I've examined the drive's contents as thoroughly as possible and actually the file system was not damaged at all. Both FAT copies were identical, all directories were in perfect condition, nothing trucated, all files were readable, not crosslinked and seemed OK.

The only thing that was damaged each time was the BootSector. Curiously, it was overwritten not with random trash but it got almost zeroed'out. Almost, because there was some apparently structured short data at the beginning of the block.

I suspected that some antivirus or cleanup utility tried to help my friend to recover the disk and failed, but the data block did not look like anything that could be a reasonale BootSector, so a virus, maybe? But immediately after the damaged BootSector a copy of it was left intact, so that would have to be a very friendly virus..

Restoring the BS from its backup immediately revived the disk and Windows stopped complaining about formatting and displayed the contents. At first I've done it manually, but it has happened already three or four times so later I've been using the TestDisk (http://www.cgsecurity.org/wiki/TestDisk). With it, restoring BS from its backup is really quick, in menu look into 'Advanced' and then 'Boot'. If you are going to use it, keep in mind that you should first check what is damaged. Your drive may have different problems.

Problem and partial diagnosis

Unfortunatelly, I was all too happy about reviving the drive and I've not preserved backup copies from all the earlier incidents to compare what was actually written to the BS, but I remembered one thing: the whole sector was almost zeroed-out and contained a few bytes of data with a "USBC" string.

The current fault has again looked like this, here's the dump:

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

00007E00  55 53 42 43 40 C4 45 89 00 04 00 00 00 00 0A 2A  USBC@ÄE........*
00007E10  00 00 00 00 3F 00 00 02 00 00 00 00 00 00 00 6A  ....?..........j
00007E20  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00007E30  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
........               .... zeroes ...
00007FC0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00007FD0  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00007FE0  00 00 00 00 72 72 41 61 A1 D2 E8 00 03 00 00 00  ....rrAa.Òè.....

As you see, that's nothing like random trash or a middle of binary file. It looks like some catch-eye magic string, some header, and then a series of a few small integers. So, I hunted after the USBC magic string and it turned out to be ...

Wait. What? SCSI over USB packet?!

The docs for USB Mass Storage Bulk Transfer, page 13 has confirmed: the 0x55534243 is the signature of a CBW packet, which is USB wrapper for an SCSI command. Dissecting the data visible above we get:

0x55534243 ('USBC') - dCBWSignature, the CBW identifier
0x40C44589          - dCBWTag, a passthrough echo
0x00040000          - dCBWDataTransferLength, bulk transfer bytes [256kB]
0x00                - bmCBWFlags [dir = 0: from host to device]
0x00                - bCBWLUN [device = 0]
0x0A                - bCBWCBLength, wrapped command length [10B]

and following 10 bytes are the 'CBWCB', the original message to the device that was wrapped.

The disk is internally a simple 2.5" drive, so - the payload is just a SCSI write command:

0x2A         - operation code "write at LBA(10)"
0x00         - flags: WRPROTECT=0, DPO=0, FUA=0, FUA_NV=0
0x0000003F   - LBA address: 0x3F [!]
0x00         - group number=0
0x0002       - transfer length: 2 blocks
0x00         - control byte

Now look at the LBA in that command.
The BootSector that was damaged was at offset 0x7E00 so at sector 0x3E.
The 0x3F sector is the BackupBootSector.
This write command was meant to write at sector 0x3F.
This write command contains 'transfer length' of two blocks.

The BootSector and its copy should be always exactly the same, so if the Backup were to be written, then probably the normal BootSector was meant to be refreshed too. That means that the BootSector was not overwritten accidentaly. The BootSector was meant to be updated, and then, probably immediately, the BackupBootSector was meant to be updated too.

But how could the whole write command get written to the drive, even with its USB wrapper, instead of being executed? And in such way that no other sectors were damaged?

I think that the answer lies somewhere in the fact how BulkTrasfer works. Looking at raw data stream, it's be something like this:

... | command#0 | bulk data for command#0 | command#1 | bulk data for command#1 | command#2 |  ...

Command#0 would be "write-a-BootSector" and command#1 would be "write-a-BackupBootSector". Transfers are performed in larger blocks and commands are short, so each command is padded to fill at least whole block. That way, the drive's controller may just read block-by-block and either read it as a command, or pass it further as block of data to process. To know how many blocks to pass, each command holds information on how long its attached bulk data is.

Now, consider what would happen if for some reason the bulk data for first command is missing. The controller inside the external drive would get:

... | command#0 | command#1 | bulk data for command#1 | command#2 |  ...

It would fetch the command#0, read it as "write-a-block-at-BootSector" and then it would treat the next blocks as the anticipated bulk data to be written ... and it would consume the immediate next command as data and leave the bulkdata#1 unread. Then, the bulkdata#1 would be consumed as a next command. In an optimistic case, it would fail and everything would get out of sync and the communication would probably be discarded and reinitialized. In pessimistic case, the bulkdata may look like a command and even further damage could occur.

Currently, I do not know if such out-of-sync errors are reported anywhere and how to check for them. I also have no idea how a block of bulk data could evaporate. But, for me, it just looks like it did. It certainly did not evaporate at random, since this problem occurred many times with this external drive. Moreover, if it happened at random, it would likely happen all over the place, not just at the first BootSector!

Ok, so if not at random, let's consider the second extreme: a systematic error.

Writing to the BootSector and to the BackupBootSector are most probably performed in exactly the same way, just the LBA address is 3E or 3F. Now consider that both operations have their bulk data systematically discarded:

... | command#0 | command#1 | command#2 |  ...

That would simply write the "command#1" as the new BootSector, leave the BackupBootSector as-is, and happily proceed with the next command. That way absolutely no errors would be observed. Considering that the operating system is glad that everything went well, the disk would operate normally until it is unplugged and then in the next day it would be dead. Or, if the system flushed and reloaded the disk's configuration, it would immediatelly drop dead.

Disclaimer

I've found the USBC marker, I've analyzed the data, it matches the CBW/SCSI command. All the rest is just my guessing.

For me, it seems like some bulkdata were discarded, but I do not know whever it was the WindowsXP's faulty DeviceDriver, the USB chips on the mainboard, or the USB->drive adapter that sits inside ADATA NH92 aluminium case, or maybe the drive's internal electronics. Well, actually we can cross out the drive's electronics, since under normal operations the adapter should translate USB/CBW/SCSI message into just SCSI message to be sent to the drive. This leaves the driver or the adapter.

Therefore, the next thing I'll suggest to the friend of mine is to buy some new USB adapter with a new case not NH92 and even better not from ADATA, put the disk into it and observe how it behaves. If there will be any similar fault again, that would mean that either the mainboard's USB controller or the WindowsXP's USB MassStorage device driver are faulty.

However, I doubt. I am quite sure that the problem is in the USB adapter that comes with the ADATA NH92. My friend uses many other USB storage devices like pendrives or cameras that expose their internal memory as a usb storage device, and the problem occurs only with that single ADATA device.

Scale of the problem

ADATA NH92 seems to have problems. Searching over the internet I've quickly found many complaints about loosing data, for example see the review's comments here. This one is in Polish only, but lots of similar can be found. Many claim that the disk worker properly and at some point of time it just "died" and asked to be formatted. Some even attempt to diagnose, and they find that i.e. BootSectors were overwritten with USBC marks :P

But, there are more serious cases. For example here at SuperUser, a user of this drive describes that he found out many more sectors were overwritten with USBC marks.

As I recalled that at first I thought that maybe some virus has damaged the drive, I've searched for the "USBC" in a bit broader sense. I've found many, really many posts on various forums that complained about some "USBC virus" that would disable the drives or damage the data. Most of them could be summarized as follows:

  • drive wants to be formatted, critical sectors were overwritten with 'USBC'
  • contents of a directory has evaporated, a large single USBCxxxx file showed up (where xxxx is some strange characters)
  • a file got damaged, its contents were overwritten with "USBC" at some point

Michal's post and search results worry me greatly.

First, in his ADATA NH92 the damage scheme was similar, but occurred at random locations. I'm currently running a low level search over the whole drive of my friend for USBC marks, but nothing was found yet. Maybe this drive was lucky or maybe the Michal's from Stackoverflow was not.

Second, the internet fora indicate that this problem obviously occurs not only in the ADATA NH92. People have reported the same problem with other external drives and card readers. This might indicate some serious problem in a whole batch of USB hardware controllers used in cheap adapters (quite probable!) or even those more expensive ones used in mainboards (unlikely) or a hideous bug in the device driver.

I'd say that under these observations the 'device driver' options is quite reasonable, but I think I've seen posts about "USBC" problems written by people using Linux, so it's hard to put all the blame on WindowsXP's drivers ;)

FYI: ADATA NH92 adapter chipset is Moai MA6116F6 422A-1035 MAN09088.1 backed by 24C02 eeprom.

tl;dr

I'm no expert. If you have similar problems, I suggest you to first consult someone who is. If you cannot, or cannot aford, then you might do the same as I told my friend: buy another USB adapter with plastic or metal case. Just an adapter and a case, you don't need a new disk. Then move the disk from the old case (i.e. NH92) into the new one, and put your old adapter aside. If the problem never repeats for some reasonable time - trash the old adapter and be happy. If the problem repeats - that's mainboard or your operating system, fix that and you can put the old adapter to use with some other disk - or it is the new adapter having the same fault. Maybe try with yet another one? Still cheaper to test than to buy or repair a mainboard..

Final note

I am quite convinced that this is the adapter's fault, but I may be wrong. If you know more about how/why the command got written to the disk instead of data, drop me a note or link!

8 comments:

Unknown said...

I have got the same problem. Testdisk report that Boot sector is not OK, but the backup of Boot Sector is OK. I have restored the Boot Sector from backup, and booted. Now Windows can see the memory card that was affected, but I get only one file called USBC.
A file recovery program told me I have different FAT copies. But I wasn't sure if I want to use "experimental" option of FAT fix in Testdisk, because there is no option to backup the FATs first. What should I do? Or, how at least I can backup the FATs, so if the fix fails, I can restore to the previous state to try something else?

quetzalcoatl said...

Ryan: sorry for the delay. I was in a middle of a holiday trip when you posted, and I didn't notice it earlier. The easiest way to backup the contents is to create an image of the whole device. Note that I mean "device" not just "volume" (partition, etc). If you have any access to any linux, there's a `dd` command that will be able to literally copy everything byte-by-byte from raw device like /dev/sda to a file, or later copy it from the file to the device. It can take some time and it will create "backups" (or rather, dumps) with no compression, so be sure to have enough free space. Regarding USBC file - it means that the restore failed. Either a wrong backup was chosen, or the backup was damaged, too, or maybe everything went well and it's "just" the real partition that is damaged. Any of it could have happened if you tried repairing the drive with other tools before running TestDisk. You 'd best have taken a whole-device dump/backup even before running testdisk, just in case. Now as it is, the best thing (except for taking a backup, late but still), I'd check with hexeditor or any other raw drive analyzer to see what are the actual contents of the device. The "USBC" file is not a file. It is a header of a section. From my experience, seeing USBC as a file usually means that this section has been misplaced during repair, and it's usually +1 or -1 sector off from the place it should be. The section sometimes gets duplicated, too. However, in all such cases, the actual correct contents of the drive were still there, and after I reconstructed the contents of that sector, everything started working well. I think I described all I knew and done.. sorry, it's hard to remember now

Alex said...

"The BootSector that was damaged was at offset 0x7E00 so at sector 0x3E".

Here is error. Sector at offset 0x7E00 has LBA number 0x3F.

Gintoki said...

I have simmilar problems with my nh92, i tried to repair ntfs tables, worked once, then win7 or any other windows OS asked me to format the disk - which was out of question(had very important data on nh92).

I used ubuntu and opened it without any problems. Works on ubuntu all the time, doesn't work on any WinOS(here and there I get it to work on windows7, but it seems that it is random luck - usualy when the rainbows are outside my window and unicorns run around).

F said...
This comment has been removed by the author.
F said...

I think this bug is happens in windows xp sp3 and windows 7. Not in windows xp sp1, win2k and any linux. But this should be checked out.

Peter said...

Interesting analysis, very helpful.
I ran into a similar issue and ended up implementing it in my software.
Would love to see variations on the issue, so share images if you can and prod me.

mEnTal Roy said...

Just happened to me, Boot sector destroyed on my SD Card, just by pluging it in.
I could recover my data using recovery software, BUT this is in no way an acceptable solution.
Somehow, Somewhere, Someone F-ed up big time with this low level stuff :)

Used a Hama SD Card adapter and a exFAT formated 256GB SandDisk Pro SD Card.
On the SD Card was a (i think) bootable partition with the Magic Lantern Firmware used to enhance Canon Cameras with RAW Video.

I pluged this into my HP Z820 Workstation with Windows 10 x64 (version 1909) which had at the same time some extremely large 20 TB Lacie Raid harddisk attached with USB-3. This harddisk definitely uses the SCSI over USB Protocol. I noticed fanciness before, when this disk is attached (other Disk sometimes would not show up at all, unless I disconnected the SCSI over USB disk...then the other USB Disk, which was previously not detected at all by Windows started to show up again).

There must be a root cause, somewhere in the realms of the SCSI over USB in combination with bootable ExFAT partitions. So far I wasn't able to restore my bootable SD Card with testdisk and the like. I could recover my footage, and to the recovery software...the partitions apears to be completely fine....unfortunatelly I can only recover data with the software...theres no option to restore the bootable partition.