5

Short version: How can I turn an arbitrary string into a 6-digit number with minimal collisions?

Long version:

I'm working with a small library that has a bunch of books with no ISBNs. These are usually older, out-of-print titles from tiny publishers that never got an ISBN to begin with, and I'd like to generate fake ISBNs for them to help with barcode scanning and loans.

Technically, real ISBNs are controlled by commercial entities, but it is possible to use the format to assign numbers that belong to no real publisher (and so shouldn't cause any collisions).

The format is such that:

978-0-01-######-?

Gives you 6 digits to work with, from 000000 to 999999, with the ? at the end being a checksum.

Would it be possible to turn an arbitrary book title into a 6-digit number in this scheme with minimal chance of collisions?

i-know-nothing
  • 789
  • 1
  • 7
  • 14
  • 1
    Why not just assign them sequentially? – Damien_The_Unbeliever Oct 11 '12 at 08:03
  • 1
    I think you'd need around ~1000 books to have a 50% chance of collision? – user541686 Oct 11 '12 at 08:24
  • @Damien_The_Unbeliever: Can't, because the list of titles is always changing and the software we're using (in this case, Readerware, a consumer library management program) doesn't allow us to manipulate the database at that level and keep track of unique IDs. Tying the ID to a book title is the only way I can think of to make the ID survive a CSV export/import into the Readerware internal database. – i-know-nothing Oct 11 '12 at 09:03
  • @Mehrdad: How is something like that calculated, anyway? – i-know-nothing Oct 11 '12 at 09:03
  • 2
    @asdf: An easy way to approximate it is to just `int max = 1000000; double prob = 1; int i; for (i = 0; i < max && prob > 0.5; i++) { prob *= (max - i) / (double)max; } print(i);`. The idea is that the probability of collision is one minus the probability of *not* colliding, which (after picking the first #) is 999,999/1,000,000 * 999,998/1,000,000 * 999,997/1,000,000 ... and you want to know when that reaches 50%. If you want to learn more about it look here: http://en.wikipedia.org/wiki/Birthday_problem (This is one example why learning discrete math is important for computer science, BTW.) – user541686 Oct 11 '12 at 09:35
  • Thanks for that explanation, @Mehrdad. According to that math it won't be an issue *for now* and hopefully not for the foreseeable future... by the time we get enough books without ISBNs, hopefully we can set up a better sequential-ID system. Or better yet maybe an in-house call number system. – i-know-nothing Oct 11 '12 at 10:24
  • 2
    @Mehrdad: Because of what you explained, I'm going to modify this scheme a bit and assign GTINs to books without ISBNs. ISBNs are a subset of the greater GTIN scheme; GTIN allows a great range of "internal use only" prefixes. Long story short this gives me 10 digits to work with instead of 6, which should result in far fewer collisions. Thanks again. – i-know-nothing Oct 11 '12 at 11:00
  • @asdf: Yup, great idea! Glad it helped. – user541686 Oct 11 '12 at 11:15

2 Answers2

1

After using code snippets for making a fixed-length hash and calculating the ISBN-13 checksum, I managed to create really ugly C# code that seems to work. It'll take an arbitrary string and convert it into a valid (but fake) ISBN-13:

       public int GetStableHash(string s)
       {
           uint hash = 0;
           // if you care this can be done much faster with unsafe 
           // using fixed char* reinterpreted as a byte*
           foreach (byte b in System.Text.Encoding.Unicode.GetBytes(s))
           {   
               hash += b;
               hash += (hash << 10);
               hash ^= (hash >> 6);    
           }
           // final avalanche
           hash += (hash << 3);
           hash ^= (hash >> 11);
           hash += (hash << 15);
           // helpfully we only want positive integer < MUST_BE_LESS_THAN
           // so simple truncate cast is ok if not perfect
           return (int)(hash % MUST_BE_LESS_THAN);
       }

       public int CalculateChecksumDigit(ulong n)
       {
           string sTemp = n.ToString();
           int iSum = 0;
           int iDigit = 0;

           // Calculate the checksum digit here.
           for (int i = sTemp.Length; i >= 1; i--)
           {
               iDigit = Convert.ToInt32(sTemp.Substring(i - 1, 1));
               // This appears to be backwards but the 
               // EAN-13 checksum must be calculated
               // this way to be compatible with UPC-A.
               if (i % 2 == 0)
               { // odd  
                   iSum += iDigit * 3;
               }
               else
               { // even
                   iSum += iDigit * 1;
               }
           }
           return (10 - (iSum % 10)) % 10;
       }


       private void generateISBN()
       {
           string titlehash = GetStableHash(BookTitle.Text).ToString("D6");
           string fakeisbn = "978001" + titlehash;
           string check = CalculateChecksumDigit(Convert.ToUInt64(fakeisbn)).ToString();

            SixDigitID.Text = fakeisbn + check;
       }
Community
  • 1
  • 1
i-know-nothing
  • 789
  • 1
  • 7
  • 14
0

The 6 digits allow for about 10M possible values, which should be enough for most internal uses. I would have used a sequence instead in this case, because a 6 digit checksum has relatively high chances of collisions.

So you can insert all strings to a hash, and use the index numbers as the ISBN, either after sorting or without it.
This should make collisions almost impossible, but it requires keeping a number of "allocated" ISBNs to avoid collisions in the future, and keeping the list of titles that are already in store, but it's information that you would most probably want to keep anyway.

Another option is to break the ISBN standard and use hexadecimal/uuencoded barcodes, that may increase the possible range to a point where it may work with a cryptographic hash truncated to fit.

I would suggest that since you are handling old book titles, which may have several editions capitalized and punctuated differently, I would strip punctuation, duplicated whitespaces and convert everything to lowercase before the comparison to minimize the chance of a technical duplicate even though the string is different (Unless you want different editions to have different ISBNs, in that case, you can ignore this paragraph).

Didi Kohen
  • 560
  • 4
  • 19
  • I don't think sequential numbers would work, because in this workflow I don't see any way to keep track of numbers that have already been used. This is how it works right now: 1. Export list of books from our library software (Readerware) into CSV 2. Run this code, which assigns ISBNs to books that don't have one in the CSV 3. Reimport CSV into Readerware, replacing missing ISBNs with fake ones (continued) – i-know-nothing Oct 11 '12 at 09:36
  • 1
    But later, when we repeat this process with new titles, we have to be able to do this again. And if we lose a copy of a book, gain one later and re-add it to the database, it shouldn't have a separate ISBN from the first copy. And this whole process has to survive different computers running both the library program and the utility, which would mean that the sequential IDs would have to be saved somewhere online and I'd have to worry about sync issues and all that... just seems like a nightmarish scenario versus tying it to the book titles, which we can assume to be unique and stable. – i-know-nothing Oct 11 '12 at 09:37
  • 1
    I agree that keeping a sequential numbers list is problematic to work with, This would require having an append only list (or database) with titles and their ISBN that would not remove titles if they are lost, so when they are found or reacquired, they will be able to use the same ISBN. Making the hashing algorithm unique for this range would be much more complex than to keep a file of already assigned ISBNs and the book names. – Didi Kohen Oct 11 '12 at 10:35
  • So after some research, I'm going to modify this scheme a bit and assign GTINs to books without ISBNs. ISBNs are a subset of the greater GTIN scheme; GTIN allows a great range of "internal use only" prefixes. Long story short this gives me 10 digits to work with instead of 6, which should result in far fewer collisions. – i-know-nothing Oct 11 '12 at 10:59
  • 1
    Are you doing this with book title only or are you adding author name as well? A chance of same titles by different authors is pretty common AFAIK. – Didi Kohen Oct 11 '12 at 11:45
  • Great call. I'll throw the author in there and strip the punctuation, etc. before the hashing. – i-know-nothing Oct 11 '12 at 12:16
  • 1
    For anyone more interested in book record deduplication overall, there is a plugin for Calibre that does it via one of several algorithms, including even a phonetic one. It's overkill for this project, for now, but could be useful for bigger libraries. http://blog.calibre-ebook.com/2011/11/calibre-plugins-duplicate-finder.html – i-know-nothing Oct 11 '12 at 12:31