14

I am trying to create a [single] md5 hash of multiple strings [in Java]. That is I want

md5(string1, string2, string3, ..., stringN)

Currently I am trying to concatenate all strings with some rarely used separator like #. That is

md5(string1#string2#...#stringN)

This looks hacky and I am worried about some weird string actually having the separator as part of it. What is the best way to do it?

Fakrudeen
  • 5,778
  • 7
  • 44
  • 70

11 Answers11

12

This could possibly be better:

md5(md5(string1) + md5(string2) + ... + md5(stringN))

It'd eliminate the separator problem, but it's hard to say how good it is.

Sebastian Paaske Tørholm
  • 49,493
  • 11
  • 100
  • 118
  • This is fine, cryptographically speaking. – Nick Johnson Jan 24 '11 at 23:35
  • Indeed, otherwise a,b,c would be the same as c,b,a. – Sebastian Paaske Tørholm Jan 25 '11 at 01:23
  • @Nick - Any links - That it doesn't reduce the entropy by much? I have an [unsubstantiated but quite certain] feeling that it might have less entropy. For one it is now less likely to result in values at the lower end of hash. We have less likelihood of final hash values of 0, 1, 2 etc. as for 0 we need all of them to produce 0 etc. It is no longer uniform on md5 range. – Fakrudeen Jan 25 '11 at 09:45
  • @Fakrudeen Secure hash algorithms, like MD5, are designed such that you can learn no information about the inputs from the output. If it were possible to introduce bias into the resulting hash simply by using it in a particular way, that would be a fatal weakness. I'm not sure what you're trying to say about hash values - what you describe is not how a secure hash function works. – Nick Johnson Jan 25 '11 at 22:47
  • 1
    I realized now what Sebastian is saying from GregS question. Since md5() would produce 128 bit number I thought you are summing up all md5 numbers rather than concatenation. Now this solution does look pretty good! – Fakrudeen Nov 07 '13 at 09:24
  • This is unnecessarily expensive and it's not immediately obvious whether it's sound (cryptographically). The only remaining upside is that it's clever. – Harold R. Eason Nov 13 '13 at 20:38
  • Nice solution but i would advice use of NULL character as separator as it never appears in string data and also avoid use of multiple MD5 as in this case. This is a slower approach – Vikram Bhat Nov 15 '13 at 08:25
  • @VikramBhat: That will only work if no input strings have null in them, but yes, it's a solution, just not very general. – Harold R. Eason Nov 15 '13 at 14:43
  • @Harold R. Eason NULL characters are called NULL because they no use in string other than termination. So no possible valid string can have NULL in it, You surely can always remove NULL characters from input string. – Vikram Bhat Nov 15 '13 at 15:45
4

It does not really matter if the separator is part of the string. You probably dont even need a separator, since you're not going to be decomposing the concatenated string into parts

krosenvold
  • 75,535
  • 32
  • 152
  • 208
  • 4
    What if the strings are "ab", "c" and "a" "bc" - they end up mapping to same hashcode if I don't use a separator. What I am worried about with separator is "ab#", "c" and "ab", "#c" which will have same hash code. [Note that I am using md5 as pretty much equality as collison is very rare - I am using it as key of Hadoop reducer and I don't want to do an equality comparison again, if I can help it] – Fakrudeen Jan 24 '11 at 18:01
  • Is this really occurring often? If yes, then indeed use a separator that shouldn't appear in the string (if you know their contents). If no, well... just don't bother! – Olivier Grégoire Jan 24 '11 at 18:12
  • Not a collision resistant hash function at all – Vikram Bhat Nov 15 '13 at 08:27
3

I've run into a similar problem before, and the best solution I could come up with was to use a non-typeable ascii character as the separator. Look at "man ascii" and pick one. My favorite is '\a', which is the ASCII symbol for the "bell" sound.

Cory Klein
  • 51,188
  • 43
  • 183
  • 243
3

If you want to be sure that there can't be collision from just moving text from one string to the next, I recommend this scheme:

md5(<len1>+str1+<len2>+str2...)

Here, len1 is a fixed-length representation of the length of str1. For md5, it would be most appropriate to use a four-byte int value (assuming you know that you wont have strings longer than 2**31). Alternatively, use "decimal length#", i.e. (in Python notation)

md5(str(len(str1))+"#"+str(len(str2))+"#"+str2+...)

This can't produce collisions by just moving text from one string to the other, since the lengths would change.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • This looks the safest from collision perspective. However I am worried about whether it will reduce the effective entropy [doesn't cover whole md5 range] because some characters in the string are always int. – Fakrudeen Nov 07 '13 at 09:29
  • If binary sequence of some other combination of +str1++str2+ would match with current combination hence we will have collisions – Vikram Bhat Nov 15 '13 at 08:22
2

for strings, I think +Coly Klein solution of adding non type-able characters is the best.

If you want a solution that works for binary data too, or you don't sure the string don't contain those characters, you can use recursive hash, e.g:

md5(md5(str1)+md5(str2)+md5(str3)+...+)

depends on the amount of data this solution can demand a lot of resources (not far ago i profiled a program and found that 97% of it's time it calculates sha1's, so I must warn you..)

Ohad Cohen
  • 5,756
  • 3
  • 39
  • 36
2

Putting all the answers together, here is a class with a single public and static method which solves the posed question in an efficient manner. Feel free to comment, critique, or use this code as you wish (public domain and all)...

import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
 * MD5Summer is a utility class that abstracts the complexity of computing
 * the MD5 sum of an array of Strings.
 * <p>
 * Submitted as an answer to the StackOverflow bounty question:
 * <a href="http://stackoverflow.com/questions/4785275/compute-md5-hash-of-multi-part-data-multiple-strings">
 * compute md5 hash of multi part data (multiple strings)</a>
 * <p>
 * This solution uses the 'fast' "byte[] to hex string" mechanism described here in
 * <a href="http://stackoverflow.com/questions/9655181/convert-from-byte-array-to-hex-string-in-java">
 * Convert from byte array to hex string in java</a>.
 * <p>
 * The MD5 sum is always calculated by converting the inputStrings to bytes based on
 * the UTF-8 representation of those Strings. Different platforms using this class
 * will thus always calculate the same MD5sum for the same Java Strings.
 * <p>
 * Using a ThreadLocal for storing the MessageDigest instance significantly reduces the amount of time spent
 * obtaining a Digest instance from the java.security subsystem.
 * <p>
 * <i>Copyright - This code is released in to the public domain</i>
 */
public final class MD5Summer {

    /**
     * Calculate the MD5 sum on the input Strings.
     * <p>
     * The MD5 sum is calculated as if the input values were concatenated
     * together. The sum is returned as a String value containing the
     * hexadecimal representation of the MD5 sum.
     * <p>
     * The MD5 sum is always calculated by converting the inputStrings to bytes based on
     * the UTF-8 representation of those Strings. Different platforms using this class
     * will thus always calculate the same MD5sum for the same Java Strings.
     * 
     * @param values The string values to calculate the MD5 sum on.
     * @return the calculated MD5 sum as a String of hexadecimal.
     * @throws IllegalStateException in the highly unlikely event that the MD5 digest is not installed.
     * @throws NullPointerException if the input, or any of the input values is null.
     */
    public static final String digest(final String ...values) {
        return LOCAL_MD5.get().calculateMD5(values);
    }

    /**
     * A Thread-Local instance of the MD5Digest saves construct time significantly,
     * while avoiding the need for any synchronization.
     */
    private static final ThreadLocal<MD5Summer> LOCAL_MD5 = new ThreadLocal<MD5Summer>() {
        @Override
        protected MD5Summer initialValue() {
            return new MD5Summer();
        }   
    };

    private static final char[] HEXCHARS = "0123456789abcdef".toCharArray();
    private static final Charset UTF8 = Charset.forName("UTF-8");


    private final MessageDigest md5digest;

    /**
     * Private constructor - cannot create instances of this class from outside
     */
    private MD5Summer () {
        // private constructor making only thread-local instances possible.
        try {
            md5digest = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            // MD5 should always be available.
            throw new IllegalStateException("Unable to get MD5 MessageDigest instance.", e);
        }
    }

    /**
     * Private implementation on the Thread-local instance.
     * @param values The string values to calculate the MD5 sum on.
     * @return the calculated MD5 sum as a String of hexadecimal bytes.
     */
    private String calculateMD5(final String ... values) {
        try {
            for (final String val : values) {
                md5digest.update(val.getBytes(UTF8));
            }
            final byte[] digest = md5digest.digest();
            final char[] chars = new char[digest.length * 2];
            int c = 0;
            for (final byte b : digest) {
                chars[c++] = HEXCHARS[(b >>> 4) & 0x0f];
                chars[c++] = HEXCHARS[(b      ) & 0x0f];
            }
            return new String(chars);
        } finally {
            md5digest.reset();
        }
    }

}

I have compared the results of this program against Linux's md5sum program with the following little test:

public class MD5Tester {
//    [rolf@rolfl ~/md5data]$ echo "Frodo Baggins" >> frodo
//    [rolf@rolfl ~/md5data]$ echo "Bilbo Baggins" >> bilbo
//    [rolf@rolfl ~/md5data]$ cat frodo bilbo 
//    Frodo Baggins
//    Bilbo Baggins
//    [rolf@rolfl ~/md5data]$ cat frodo bilbo | md5sum 
//    a8a25988435405b9a62634c887287b40 *-
//    [rolf@rolfl ~/md5data]$ 


    public static void main(String[] args) {
        String[] data = {"Frodo Baggins\n", "Bilbo Baggins\n"};
        String md5data = MD5Summer.digest(data);
        System.out.println("Expect a8a25988435405b9a62634c887287b40");
        System.out.println("Got    " + md5data);
        if (!"a8a25988435405b9a62634c887287b40".equals(md5data)) {
            System.out.println("Data does not match!!!!");
        }
    }
}
rolfl
  • 17,539
  • 7
  • 42
  • 76
2

you can use base64 to encode your String before you append your string. then in md5 function split your string and decode it. for example

public class MutilMd5 {

public static void main(String[] args) throws Base64DecodingException {
    String s1 = "12#3";
    String s2 = "#12345";

    multMd5(Base64.encode(s1.getBytes()) + "#" + Base64.encode(s2.getBytes()));

}

public static void multMd5(String value) throws Base64DecodingException {
    String md5 = "";
    String[] encodeStrings = value.split("#");
    if (encodeStrings != null) {
        for (String encodeString : encodeStrings) {
            System.out.println(new String(Base64.decode(encodeString.getBytes())));
            md5 = md5 + DigestUtils.md5Hex(encodeString);
        }
    }

    System.out.println(md5);
}

}

output is

12#3

12345

13094636ff02b51be53c496d04d39bc2375704c2e00da07d2c9acc7646b2a844

Community
  • 1
  • 1
chunxiao
  • 58
  • 4
2

Just don't separate them. It's a hash method: there is no use in separating them...

MessageDigest md5 = MessageDigest.getInstance("MD5");
byte[] bytes = ...;
for (String toHash: stringsToHash) {
  md5.update(toHash.getBytes("UTF-8"));
}
md5.digest(bytes);
Olivier Grégoire
  • 33,839
  • 23
  • 96
  • 137
1

You can use MessageDigest class to generate your code. If I were you, if at a preprocess stage I know the length of each String, I would pass them as a unique String. If the Strings you're passing have different random lengths that can't be known I would hash them one by one, but I need to know if the origin entity and the receipt entity are well syncronized to know the length of the messages they are passing between each other.

private static final char[] HEXADECIMAL = { '0', '1', '2', '3',
    '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public  String hash(String stringToHash)  {
    try {
        MessageDigest md = MessageDigest.getInstance("MD5");
        byte[] bytes = md.digest(stringToHash.getBytes());
        StringBuilder sb = new StringBuilder(2 * bytes.length);
        for (int i = 0; i < bytes.length; i++) {
            int low = (int)(bytes[i] & 0x0f);
            int high = (int)((bytes[i] & 0xf0) >> 4);
            sb.append(HEXADECIMAL[high]);
            sb.append(HEXADECIMAL[low]);
        }
        return sb.toString();
    } catch (NoSuchAlgorithmException e) {
        //exception handling goes here
        return null;
    }
}
Alex
  • 975
  • 10
  • 24
1

From my understanding, you want to hash lists of strings ensuring that no two different lists give the same result. This can be solved without thinking of the hash function at all.

You need a function String f(List<String> l) where no two input values result in the same output (an injective function from List<String> to String). With this, you can give the output to your hash function, and be ensured that there will be no collisions as far as the hash function itself ensures (note MD5 was broken years ago, so it may not be an appropriate choice). Here are 2 ways to implement f:

Transforming into a susbset of the character set

The most straightforward way is to just map every input to a subset of the character set of String that does not include your separator character:

public static String hex(String s) {
    try {
        String o = "";
        for(byte b: s.getBytes("utf-8"))
        o += String.format("%02x", b&0xff);
        return o;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

public static String f(String... l) {
    if (l.length == 0) return "";
    String o = hex(l[0]);
    if (l.length == 1) return o;
    for (int i = 1; i < l.length; i++) o += "#" + hex(l[i]);
    return o;
}
f("a#","b") => 6123#62
f("a","#b") => 61#2362

Length prefixing

This is also pretty simple, but has the disadvantage that it can't be rewritten to work in a stream.

public static String prefix(String s) {
    return s.length() + "." + s;
}

public static String f(String... l) {
    if (l.length == 0) return "";
    String o = prefix(l[0]);
    if (l.length == 1) return o;
    for (int i = 1; i < l.length; i++) o += "#" + prefix(l[i]);
    return o;
}
f("a#","b") => 2.a##1.b
f("a","#b") => 1.a#2.#b
Harold R. Eason
  • 563
  • 2
  • 9
  • @Fakrudeen: Yes, it's the same as his "md5(+str1++str2...)" except it has dots in case the input begins with a number, which may not be necessary, but it's easier to see that it works this way. I also was going to suggest escaping occurrences of `#` (by changing them to `\#`) - then you wouldn't need to encode the entire string as in my first solution, but it's nontrivial to get this right and it introduces subtleties in a naive implementation (e.g., `f("a\\","#b") => a\#\#b` and `f("a##b") => a\#\#b`). – Harold R. Eason Nov 15 '13 at 14:40
1

As You are using string as input and We know strings do not have NULL character so may use NULL character as separator for all Strings. You check input for NULL character while authentication.

md5(String1+NULL+String2+NULL+String3....)

Saves you time of multiple Md5.

Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19