-1

I need to make a file storage where files will be addressed by theirs SHA-1. I know Git uses such approach.

Do you know exists Java implementation of this approach? Or correct name of such storage for searching in Google.

Aleks Ya
  • 859
  • 5
  • 15
  • 27
  • http://stackoverflow.com/questions/6293713/java-how-to-create-sha-1-for-a-file. See if this helps – prasanth Apr 02 '17 at 21:49
  • 1
    You still want to do that now that SHA-1 has practical collisions? Anyway, see https://en.wikipedia.org/wiki/Content-addressable_storage – Ry- Apr 02 '17 at 21:49
  • 1
    @Ryan I'm not sure what you're saying. Collisions themselves are unavoidable - **all** hash algorithms have collisions. Collisions are completely unavoidable when all possible sets of data in the entire universe get mapped into a fixed number of bits. It's being able to **create** collisions in a predictable manner that can be a problem if the hash is use to validate the data. – Andrew Henle Apr 02 '17 at 22:22
  • 1
    @AndrewHenle: *“It's being able to create collisions in a predictable manner that can be a problem”* Yes (“practical collisions”), which is what Google proved to be possible (and made possible for everyone in a limited fashion thanks to the nature of SHA-1) this year. – Ry- Apr 02 '17 at 23:13
  • Another perspective on the same problem: https://blog.qualys.com/ssllabs/2014/09/09/sha1-deprecation-what-you-need-to-know You should seriously consider sha-2 based algorithms. – Dave Apr 03 '17 at 00:01
  • @Ryan Everyone is possible to win the lottery but ... – ElpieKay Apr 03 '17 at 06:47
  • @ElpieKay: No, seriously, anyone can make colliding values based on Google’s pattern. Instantly. https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html – Ry- Apr 03 '17 at 06:53
  • @Ryan But no hash is guaranteed to be unique anyway. Any data indexing scheme that treats a hash as guaranteed to uniquely identify any one set of input data is already broken. Every hash value effectively has an infinite number of inputs that will generate the exact same hash value, so once a collision is detected by an indexing system, the data itself has to be examined if 100% reliability is required. Crafted collisions are problems when hashes are used to *validate* data as (probably) not being tampered with. – Andrew Henle Apr 03 '17 at 09:36
  • @AndrewHenle: *“Any data indexing scheme that treats a hash as guaranteed to uniquely identify any one set of input data is already broken.”* Um, no. Do you know how big 2^128 is? The facts are these: the SHA-1 vulnerability is extremely relevant for content-addressable storage, and switching to SHA-2 or -3 would work fine, because you’re not going to store an infinite number of files. – Ry- Apr 03 '17 at 10:35
  • @Ryan *Do you know how big 2^128 is?* Do ***you*** know how big all the possible combinations of data in the entire universe are? Map ***all*** that into 128 bits and there are an infinite number of collisions. If you want 100% reliability, you need to account for the fact that any hash can not be guaranteed to be unique. If my standards are too high, that's not my problem. – Andrew Henle Apr 03 '17 at 11:26
  • @AndrewHenle: It doesn’t matter that there are infinite strings of data because you’re not going to store infinite data. For a 256-bit hash, you can store 2^128 distinct strings, taking up at the very minimum 4874388904686184832415301632 TB of space, and still only have a *less than* 50% chance of having *one* collision. – Ry- Apr 03 '17 at 11:35
  • @AndrewHenle: In other words: I’m thinking of a number between 1 and 1000. If you guess it right and then win the 6/49 jackpot twice in a row, we can store 480000000000000000000000000000 files in a SHA-256-based CAFS and see if your luck holds. – Ry- Apr 03 '17 at 11:52
  • @Ryan I see that your goal isn't 100% reliability. Nice to know. You're also assuming the hash is a random value. It really isn't, else there'd be no way to create collisions. So your math is also wrong. And you're ignoring the fundamental problem: if SHA-1 can be broken, so can SHA-256. – Andrew Henle Apr 03 '17 at 11:56
  • @AndrewHenle: Indeed, you shouldn’t use a hash function with a low-complexity means of creating collisions – hence my warning against SHA-1. What was your point again? – Ry- Apr 03 '17 at 12:01
  • @AndrewHenle: *“if SHA-1 can be broken, so can SHA-256”* No, that doesn’t follow. – Ry- Apr 03 '17 at 12:03
  • @Ryan 1. you can successfully track the very two pdfs created by the collision team in the same git repository because their blob sha1 values are different. 2. you see how much they have cost to create such a collision. 3. have you thought twice why they used pdf files instead of any more common text files? 4. git is so far safe enough though it uses sha1sum. – ElpieKay Apr 03 '17 at 15:17
  • @ElpieKay: Ah, sorry, I guess that’s fine then. Please feel free to use SHA-1 instead of SHA-256 out of some sense of “I told you so” (or those yummy two-keystroke savings) because “it costs money to create a collision if I happen to SHA-1 something that is not the object itself”. – Ry- Apr 03 '17 at 17:32
  • Everybody, thanks for your comments! @Ryan your link to "Content-addressable storage" answers my question. [If you want] make an answer and I'll mark them as the best – Aleks Ya Apr 03 '17 at 23:37

1 Answers1

0

This approch is named Content Addressable Storage.

Aleks Ya
  • 859
  • 5
  • 15
  • 27