1

For a project I'm working on, I need a way to calculate a unique integer representation of a data structure with type [(int, int)], i. e. a collection of (nonnegative) integer pairs. The requirement is that while the ordering within the pair matters, the collection itself is order-insensitive. After some searching, I've come believe that a suitable solution would be to encode each pair using the Cantor pairing function and xor the results.

The range will be fairly small, say 1-700 for the first integer in the pair, 1-10 for the second one and the list will contain around 5-15 of these pairs.

If you believe there's a better solution, let me know, but an answer "yes, this will work" would be terrific as well :)

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
Jakub Lédl
  • 1,805
  • 3
  • 17
  • 26
  • "Hash function" and "unique representation" are two phrases that don't normally sit well together. Which do you want? What are the requirements? For example, would simply concatenating the two integers be acceptable? – Oliver Charlesworth Jan 13 '13 at 18:40
  • 1
    A unique integer. Would "hashcode" be the correct term? – Jakub Lédl Jan 13 '13 at 18:43
  • 1
    In general, no. Hash functions usually reduce the amount of information, and so collisions will occur (i.e. multiple inputs will have the same output). You can avoid this with a so-called *perfect hash*, but that is fiddly, and requires knowledge of your entire data-set ahead of time. – Oliver Charlesworth Jan 13 '13 at 18:44
  • 1
    Well, I've never had a chance to get properly acquainted with terminology. I will elaborate upon the requirements. – Jakub Lédl Jan 13 '13 at 18:47
  • 1
    `The requirement is that while the ordering within the pair matters, the collection itself is order-insensitive.` I don't understand this. Does it mean that (a,b) and (b,a) should be considered different and *could* both be present? – wildplasser Jan 13 '13 at 18:52
  • @wildplasser Yes. `(a, b) != (b, a)`, but `[(a, b), (a, c)] == [(a, c), (a, b)]`. – Jakub Lédl Jan 13 '13 at 18:54
  • Ah, it is a set of ordered pairs. First approach would be to use H(a,b) := a*p op b*q, with {p,q} := relatively prime, and op := {+, Xor}. BTW : if the domain for your numbers is very small you could use multiplied primes, or even zobrist tables. – wildplasser Jan 13 '13 at 18:57
  • 1
    @SergeyS: Is that relevant? – Oliver Charlesworth Jan 13 '13 at 19:00
  • @Oli Charlesworth - more or less, but yes – SergeyS Jan 13 '13 at 19:01
  • @JakubLédl What really is the purpose of uniquely representing the data structure? Some simple hashing function (mostly built into framework) would do if collision is ok. What u r askin here is impossible. U can uniquely represent 2 integers ([See this question](http://stackoverflow.com/questions/919612)) with great effort. Now asking a number of integers (ie a collection) to be represented uniquely with another number is impossible (as long as you want the integer to be handled by normal computers. Otherwise get ready to to have an integer in the range 1E100s..) – nawfal Jan 13 '13 at 19:06
  • Do you need to search for singletons, or do the searches only need to check the existence of pairs? – wildplasser Jan 13 '13 at 19:07

1 Answers1

2

[This answer assumes that when you say "unique", you really mean that; collisions are unacceptable.]

If the aim is to somehow uniquely map an arbitrary-sized collection of integer (pairs) onto a single (reasonably-sized) integer, then the answer is essentially "that's not possible". This can easily be demonstrated by appealing to the pigeonhole principle.

If the size of your collections is extremely limited, and the range of the input integers is extremely limited, then you may be able to do something. But in the general case, I'd suggest looking for a different solution to whatever your top-level problem is.


Update

As a worked example, let's consider the parameters you added to your question. 700 * 10 = 7000, so you need roughly 13 bits to uniquely represent each possible pair. With a maximum of 15 pairs, that's 195 bits needed in total.

Now, if order doesn't matter, then you can theoretically drop log2(15!) = 40 bits.* So in theory, you need an output datatype with 155 bits. Is that tractable?


* How one achieves this in practice is another question... ;)
Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • Well, I know the integers will always be nonnegative and fairly small, so Cantor should work. I'm not sure about the idea of xoring the results to obtain a representation of the whole list (I remember reading *somewhere* that this would work for an unordered set, but a confirmation would be nice). – Jakub Lédl Jan 13 '13 at 19:05
  • @JakubLédl: What do you mean by "representation of the whole list"? Are you saying that you need to encode a whole list of these pairs with a single integer? If so, that wasn't clear from your question. And if so, then the answer is basically "that's not possible". – Oliver Charlesworth Jan 13 '13 at 19:09
  • 1
    Yes. I have no education in this area, so forgive me if this is somehow outrageous :) – Jakub Lédl Jan 13 '13 at 19:12
  • @JakubLédl: Ok, I completely rewrote my answer... ;) – Oliver Charlesworth Jan 13 '13 at 19:17
  • Thanks for the update, and the explanation, this was supposed to be an optimization anyway, so a different solution is already implemented... – Jakub Lédl Jan 13 '13 at 19:34