5

I'm guessing that a typical filesystem tends to keep some kind of checksum/CRC/hash of every file it manages, so it can detect file corruption.

Is that guess correct? And if yes, is there a way to access it?

I'm primarily interested in Windows and NTFS, but comments on other platforms would be welcome as well... Language is unimportant at this point, but I'd like to avoid assembler if possible.

Nakilon
  • 34,866
  • 14
  • 107
  • 142
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • 3
    No. CRC checking is the job of the disk drive. – Hans Passant Oct 18 '11 at 19:03
  • @HansPassant At the block level, sure. But what about file level? – Branko Dimitrijevic Oct 18 '11 at 19:06
  • depending on the OS and filesystem that can be true... for example for ZFS (available for Sun, Linux and OSX)... anyway IF that is calculated/stored by the filesystem it is usually not accessible via a documented API... to get to it you usually need to dig deep and use severaly undocumented stuff which in some cases need specific permissions (Administrator, root or even a kernel module/driver)... that is usually much more trouble than just calculating your own checksum... what exactly is your goal ? – Yahia Oct 18 '11 at 19:06
  • @Yahia Yup that is what I was thinking but I needed a confirmation. The goal is performance through avoiding I/O for file content if the filesystem already "accessed" that content and calculated the checksum. – Branko Dimitrijevic Oct 18 '11 at 19:14
  • @BrankoDimitrijevic, that performance hit is one good reason why file systems don't try to second-guess the hardware. – Mark Ransom Oct 18 '11 at 19:27
  • @MarkRansom I'm really not the expert on the subject so forgive me if I'm completely on the wrong path here... I think there is a difference between block-level and (supposed) file-level checksums. All the blocks in the file may have correct checksums, yet the file as a whole may be corrupt if some block is misplaced (e.g. data structure that holds a list of blocks did not update correctly due to a power failure). So while filesystem may not necessarily scan the contents of blocks in software, I'm guessing it would still be useful to "aggregate" block level checksums into file-level checksums. – Branko Dimitrijevic Oct 18 '11 at 19:41
  • Think of the logistics - if you changed a single byte in the middle of a file, how would the file system recalculate the file checksum? At what point would the file system try to use the checksum to validate file integrity? – Mark Ransom Oct 18 '11 at 19:45
  • @MarkRansom It would subtract (from the file checksum) the old block checksum and add the new one. And during "check disk" it would use it to compare filesystem data structures that "point" to the file with the file itself. I'm pulling this form my behind of course and might be on the wrong track completely... ;) – Branko Dimitrijevic Oct 18 '11 at 20:05
  • You're assuming a simple additive checksum. But that additive checksum would not detect blocks out of order, which contradicts your previous comment! At any rate, NTFS does not maintain per-file checksums. It uses journalling to ensure that it doesn't lose blocks. – Raymond Chen Oct 19 '11 at 13:12
  • Possible duplicate of [There is in Windows file systems a pre computed hash for each file?](http://stackoverflow.com/questions/1490384/there-is-in-windows-file-systems-a-pre-computed-hash-for-each-file) – user Nov 18 '15 at 01:57
  • See ZFS - it has checksums. – i486 Dec 11 '20 at 15:21

2 Answers2

3

OK, it appears that what I'm asking is impossible.

BTW, this was also discussed here: In Windows file systems, is there a pre-computed hash for each file?

Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
1

In the majority of filesystems and the storage hardware they would keep checksums of allocation units, not full files.

The checksums in the hardware are probably not accessible at all in general, and the checksum of the filesystem clusters would not be very useful for the great majority of cases so would be difficult to get, if possible.

Thymine
  • 8,775
  • 2
  • 35
  • 47
  • It's unfortunate that there isn't a hash of some sort, as Microsoft could then optimize not replacing identical files (same timestamp and hash). – PRMan May 15 '18 at 23:49
  • There are filesystems that achieve that, although not working exactly how you're describing. ZFS is the one I know the most about, but in general this strategy is called `copy-on-write` – Thymine Jun 15 '18 at 18:35