13

Recently I'm investigating reasons for binary file corruption. Specificly, we have an android app, the native part can read/write binary file on SD card. Sometimes, the binary file corrupts for unknown reasons. We have collected some of these files from different users, and found some interesting facts.

One majority kind of corruption is that, the first 4096 bytes of the binary file are erased. When I hexdump these files, the first 4096 bytes are all zeros. Not more than 4096 or less than 4096, but exactly 4096 bytes. I think this is not coincidence. I know 4096 bytes is one page size. But lacking experience, I cannot figure out the reason, and more importantly, I don't know how to avoid such things for other users/devices.

Besides that, in the middle of some binary files, there are also some continuous zeros segments, which it shouldn't be there. If it is not our programmes' bug, is there any possible reasons which may related to platform/device kernel, or anything else like device suddenly out of power?

I hope anyone who have experienced similar situations can give me some hint/advice/solutions etc. This really confused me a lot.

Many thanks~

songlj
  • 927
  • 1
  • 6
  • 10
  • make sure all file operation like file open and file close happen properly in lifecycle of application. – Jeegar Patel Jan 26 '16 at 06:21
  • for binary file copy in android: http://stackoverflow.com/a/11212942/2183287 – fatihk Jan 26 '16 at 06:52
  • Be careful with locations where you write your data, maybe you're trying to write to some illegal memory locations and Kernel automatically taking care of that. –  Jan 26 '16 at 13:56
  • 1
    Have you tried different SD cards? There are many hacked SD cards. I had this happen to me and lost many files. – Bing Bang Jan 28 '16 at 05:16
  • @BingBang No, I cannot do that because these files are collected from different users, not on my own device. – songlj Jan 28 '16 at 07:19
  • It'd help a lot to see the code that modifies the file. Are you writing to the file with more than one thread? I've seen page-size based corruption before. – Andrew Henle Feb 15 '16 at 03:16
  • I would suggest to re-review all the code that is possible to read/write that binary file in your code instead of suspecting the low level libraries. – Shane Lu Feb 15 '16 at 04:00
  • Is there a chance that you mmap the file? possibly with a page size struct was used to access the data a page at a time?... except one of the conditionals used something like `if (s.member = NULL)` Note the `=` instead of `==` . – technosaurus Feb 15 '16 at 06:16
  • @AndrewHenle do you mean multi-thread issue can cause page-size based corruption? could you please give me a bit more details about what you experienced before? thanks – songlj Feb 16 '16 at 08:48
  • 2
    I have had to use SD Cards in several high-uptime hardware projects and the one thing I can tell you is that they are to be regarded as **volatile storage**. It is simply not possible to rely on them. Sandisk Industrial SD Cards are the ones I distrust the least and they are seriously expensive. Your average el cheapo SD Card is likely to fail after a few gigabytes of read/write cycles. I recommend you take your code and attempt to test it on some other media (a hard drive or SSD through an USB adaptor and powered USB hub would be best) and see if the corruption is still occurring. –  Feb 16 '16 at 10:03
  • *do you mean multi-thread issue can cause page-size based corruption? could you please give me a bit more details about what you experienced before?* Different hardware, but it was a timing issue with concurrent writes to the same page (or disk sector - they were the same size in our case) One thread would write the first half of the bytes, the second thread the last half. Last one won - the other thread's data would be zeroed out. Either the system would have two pages modified concurrently, or the disk controller couldn't handle concurrent access to the same sector. – Andrew Henle Feb 16 '16 at 13:46
  • 1
    We never did figure out why. It was a case of "Doctor, it hurts when I do this." So we stopped doing that. Our guess was that it was the disk controller. And it only happened with multithreaded access. If you're doing multithreaded access, put a `mutex` on all the writes and see if it goes away. Or as @Wossname has suggested, try writing to different hardware. – Andrew Henle Feb 16 '16 at 13:48

3 Answers3

3

I have some simmilar experience in some embedded applications that corrupt binaries. First of all, double check your file handling (specially in multithreading environments), I can imagine you have done it thoroughly. Then, try to sync all the writings. The linux kernel doesn't write as you command your app to write, but buffers data before flushing to disk.

http://linux.die.net/man/2/sync

Hope this helps.

Nikilator
  • 106
  • 7
1

check your file handling thats usually the problem in my experience

massex12
  • 33
  • 4
1

Broken files, or even broken file system are strangely caused by the '4096 bytes files'.

This corruption is due to cluster size for ext4 file system that is equal to page size.

At the moment, the default size of a block is 4KiB, which is a commonly supported page size on most MMU-capable hardware. This is fortunate, as ext4 code is not prepared to handle the case where the block size exceeds the page size.

PS

I am taking ext4 as it is the default file system for Linux based OSs (including but not limited to Android)

Now off to the reasons why could the 4KiB file be dangerous, reasons are simple to be understood:

  • Improper file handling: Wrong procedure in creating, reading, editing or deleting the files could damage the files and break them maybe alongside the full file system. These 'improper procedures' include non-human behavior and accidents. (PS: this is not limited to 4KiB files)
  • Improper Low Level data treatment: not a common case, still, it's possible. That happens when the Kernel or user is trying to edit the file system at low level. (You will need to investigate further this case as it should be written in a too long article!)
  • There are still many strange ways to get data broken, i am trying to keep brief. The other reasons are dependent of many factors, so i mentioned the most common causes for that issue on an Android device.

You can keep reading more here:

  • Ext4 Disk Layout : All what a researcher needs to know about Ext4
  • Exfat file system : Exfat insight! Another commonly used file system (even with Android) that is know to be too vulnerable!
  • ...
El Don
  • 902
  • 7
  • 14