3

For any two sequences a, b, where a = [a1,a2,...,an] and b = [b1,b2,...,bn] (0<=ai, bi<=m), I want to find an integer function f that f(a) = f(b) if and only if a, b have the same elements, without concerning about their orders.For example, if a = [1,1,2,3], b = [2,1,3,1], c = [3,2,1,3], then f(a) = f(b), f(a) ≠ f(b).

I know there is a naive algorithm which first sort the sequence and then map it to an integer. For example, after sorting, we have a = [1,1,2,3], b = [1,1,2,3], c = [1,2,3,3], and suppose that m = 9, using decimal conversion, we will finally have f(a) = f(b) = 1123 ≠ f(c) = 1233. But this will take O(nlog(n)) time using some kind of sorting algorithm (Don't use non comparison sorting algorithms).

Is there a better approach ? Something like hash? An O(n) algorithm?

Note that I also need the function easy to be inversed, which means that we can map an integer back to a sequence (or a set, more concisely).

Update: Forgive my poor description. Here both m, n can be very large (1 million or larger). And I also want the upper bound of f to be quite small, preferably O(m^n).

akirast
  • 1,137
  • 1
  • 8
  • 10
  • 1
    possible duplicate of [Check if array B is a permutation of A](http://stackoverflow.com/questions/10639661/check-if-array-b-is-a-permutation-of-a) – Bernhard Barker Nov 06 '13 at 13:36
  • Or [Check if 2 arrays are similar without hashing or sorting](http://stackoverflow.com/questions/9604801/check-if-2-arrays-are-similar-without-hashing-or-sorting). – Bernhard Barker Nov 06 '13 at 13:37
  • Hashing and linked lists might help. But it's going to be impossible to make this function reversible unless your sets are bounded in some way. For example, your naïve algorithm fails because `f([1,1,2,3]) ≡ f([11,23])`. – r3mainer Nov 06 '13 at 13:49
  • @wildplasser m can be very large. – akirast Nov 07 '13 at 01:40
  • @Dukeling I don't think the questions are duplicate since I need to convert the array to an integer. – akirast Nov 07 '13 at 01:42

2 Answers2

4

This works for sufficiently small values of m, and sufficiently small array sizes:

#include <stdio.h>

unsigned primes [] = { 2,3,5,7,11,13,17, 19, 23, 29};
unsigned value(unsigned array[], unsigned count);

int main(void)
{
unsigned one[] = { 1,2,2,3,5};
unsigned two[] = { 2,3,1,5,2};
unsigned val1, val2;

val1 = value(one, 5);
val2 = value(two, 5);
fprintf(stdout, "Val1=%u, Val2=%u\n", val1, val2 );

return 0;
}

unsigned value(unsigned array[], unsigned count)
{
unsigned val, idx;

val = 1;
for (idx = 0; idx < count; idx++) {
        val *= primes [ array[idx]];
        }

return val;
}

For an explanation, see my description here.

Community
  • 1
  • 1
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • 1
    It's not *too* difficult to see what's going on there, but possibly only because I've seen this idea a few times before, and I know C. You should consider adding some pseudo-code or a high-level description. – Bernhard Barker Nov 06 '13 at 15:03
  • Why should I add pseudo code? The most complex construct in the whole thing is **one for loop**, which is present an about any programming language. And so are function calls. Compare it to the python rubble below and pick the one you find easiest to read. – wildplasser Nov 06 '13 at 16:09
  • @wildplasser: The for loop isn't difficult. What's difficult (for someone who hasn't seen this before) is *what the hell are prime numbers suddenly doing here?* – j_random_hacker Nov 06 '13 at 16:13
  • Given the mathematical way in which the OP sketches the problem, he should be able to find it out. And the example is small enough to do the calculations by hand. (or add a few printf()s at strategical places) No side wheels required. – wildplasser Nov 06 '13 at 16:20
  • 1
    +1 for the elegant solution. I just want to support the others who suggested adding a comment within the code would go a long way. E.g. `/* Map each array element to a prime number and take the product. */` – Matt Nov 07 '13 at 01:22
  • Impressive answer, and would you please read the update of the question? – akirast Nov 07 '13 at 01:53
  • For larger problems it will not work. period. @Matt: I added a link to the anagram answer, which has an explanation. – wildplasser Nov 07 '13 at 08:39
3

Wow, @wildplasser answer is actually super smart. To expand a bit:

Any number can be decomposed in a unique fashion in prime numbers (this is known as the fundamental theorem of arithmetic). His answer relies on that, by building a number for which the input array is a representation of prime factor decomposition. As multiplication is commutative, the exact order of the elements in the array is of no importance, but a given number is associated to one (and only one) sequence of elements.

His solution can be expanded for arbitrary size, e.g. in Python:

import operator
import itertools
import math

class primes(object):
    def __init__(self):
        self.primes = [2,3,5,7,11]
        self.stream = itertools.count(13, 2)

    def __getitem__(self, i):
        sq = int(math.sqrt(i))
        while i >= len(self.primes):
            n = self.stream.next()
            while any(n % p == 0 for p in self.primes if p <= sq):
                n = self.stream.next()
            self.primes.append(n)
        return self.primes[i]

def prod(itr):
    return reduce(operator.mul, itr, 1)

p = primes()

def hash(array):
    return prod(p[i] for i in array)

With the expected results:

>>> hash([1,2,2,3,5])
6825
>>> hash([5,3,2,2,1])
6825

Here, 6825 = 3^1 x 5^2 x 7^1 x 13^1, as 3 is the '1' prime (0-indexed), 5 the '2', etc...

>>> 3**1 * 5**2 * 7**1 * 13**1
6825

Building the number itself is O(n) multiplications, as long as the final result remains in the domain of int that you are using (unfortunately, I suspect it might get out of hand quite quickly). Building the series of prime number with an Eratosthenes Sieve as I did is asymptotically O(N * log log N), where N is the m-th largest prime. As asymptotically, N ~ m log m, this gives an overall complexity of O(n + m * log m * loglog (m * log m))

Using a similar approach, instead of taking the prime number decomposition, we can also consider the array to be a representation of the decomposition of a number in a base. To be consistent, this base has to be larger than the larger number of similar elements (e.g. for [5, 3, 3, 2, 1], base must be > 2 because there are two 3). Being on the safe side, you can write:

def hash2(array):
    n = len(array)
    return sum(n**i for i in array)

>>> hash2([1,5,3,2,2])
8070
>>> hash2([2,1,5,2,3])
8070

You can improve this by computing the largest number of similar elements in the array first, but the hash2 function will be a real hash only when used with the same basis, so the prime number decomposition is probably safe if you work with arrays of varying length and composition, as it will ALWAYS return the same unique integer per bag of numbers.

val
  • 8,459
  • 30
  • 34
  • 1
    I believe that the Eratosthenes sieve algorithm is `O(n log log n)` to find the primes *up to* `n`. Since the `k`-th prime number is asymptotically `k log k`, the complexity of finding the *first n* primes should be something like `O(n log n log log n)`. – rici Nov 06 '13 at 22:06
  • @rici: you are absolutely correct, I got confused here. That is corrected. – val Nov 07 '13 at 09:09