3

I've got a massive file of hex encoded MD5 values that I'm using linux 'sort' utility to sort. The result is that the hashes come out in sequential order (which is what I need for the next stage of processing). E.g:

000001C35AE83CEFE245D255FFC4CE11 
000003E4B110FE637E0B4172B386ACAC 
000004AAD0EB3D896B654A960B0111FA

In the interest of speeding up the sort operation (and making the files smaller), I was considering encoding the data as base32 or base64.

The question is, would an alpha-sort of the base32/64 data get me the same result? My quick tests seem to indicate that it would work. For example, the above three hex strings correspond 1:1 to these base64 strings:

AAABw1roPO/iRdJV/8TOEQ==
AAAD5LEQ/mN+C0Fys4asrA==
AAAEqtDrPYlrZUqWCwER+g==

But I'm unsure as to the sort order when it comes to special characters used in Base64 like "/" and "+" and how those would be treated in the context of an alpha sort.

Note: I happen to be using the linux sort utility but the question still applies to other alpha-sorting tools. The tool used is not really part of the question.

Mark Thomson
  • 739
  • 5
  • 12

1 Answers1

7

I've since discovered that this isn't possible with the standard base32/64 implementations. There exists however a base32 variation called "base32hex" which preserves sort ordering, but there is no official "base64hex" equivalent.

Looks like that leaves creating a custom encoding like this.

EDIT: This turned out to be very trivial to solve. Simply encode in base 64 then translate character to character with a custom table of characters that respects sort order.

Simply map from the standard Mime 64 characters:

  "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

To something like this:

  "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz|~"

Then sorting will work.

Community
  • 1
  • 1
Mark Thomson
  • 739
  • 5
  • 12