Using Hashing to count the number of occurrences of a pattern in an integer array

Question

So I have a problem that is ,I have an integer array and first I define an interval as a good interval iff, within the interval every integer appears an even (including zero) number of times. I want to find the number of good intervals in a given integer array. For example, if array = [7, 7, 1, 5, 5, 1], the good intervals are [1, 2], [3, 6], [4, 5], [1, 6] corresponding to the contiguous subarrays [7, 7], [1, 5, 5, 1], [5, 5], [7, 7, 1, 5, 5, 1]. If array = [4, 5, 6, 5, 4], then there are no good intervals.

I have a naive solution which would be to use 2 for loops and check for every possible interval whether there is a good interval but this takes O(n^2) time. I want to find a better solution that runs in O(nlogn) time and I feel that using hashing may give me a faster solution, the problem is I do not know how to incorporate it into my answer. I have been reading up on the rolling robin-karp hashing algorithm to give me some ideas but I think that this algorithm is not applicable to what I seek. Do you guys have any ideas for an algorithm to solve this in O(nlogn) time that uses hashing?

Paul Hankin · Accepted Answer · 2022-10-28T14:44:51.063

2

Suppose your array is called A.

For each index i, you could compute the set of the elements in A[:i] that appear an odd number of times. Now your problem is equivalent to finding all i, j such that these sets are equal.

This is still O(n^2) in the worst case, but instead of using sets, you can use a hash of the sets. For efficiency, the hashes need to be incrementally computable from the hash of the previous set. One such way is to use the XOR of a (universal hash function) of the elements of the set. With this, you can add and remove single elements from the hash in O(1), and it has the benefit that adding and removing an element is exactly the same operation, making it very suitable for this problem, where the parity and not the exact count of the elements is important.

So compute this new array for indexes 0 to n inclusive:

B[0] = 0
B[i+1] = HASH(A[i]) XOR B[i]

Then count all 0<=i<j<=n such that B[i]=B[j] (which you can do in O(n) time, for example with a regular map).

This is a probabilistically correct algorithm, since if you are unlucky, a non-empty set can have zero hash. If you use a universal b-bit hash, an upper bound for the probability it's correct is approximately exp(-n²/2^(b+1)) -- obtained from the birthday problem probability. So if you use a 128-bit hash, you're pretty safe for any input you're likely to find in practice.

As examples, here's Python code that implements the naive version which uses sets and runs in O(n^2) in the worst case.

import collections

def naive_evens(A):
    B = frozenset()
    counts = collections.Counter()
    counts[B] += 1
    total = 0
    for a in A:
        B = B.symmetric_difference({a})
        total += counts[B]
        counts[B] += 1
    return total

Here's the probabilistically correct version that uses hashing and runs in O(n) time. It uses HASH as a universal hash (with random seed HA), and parameters HW and HM which describe the word-size and number of bits of hash to create. To avoid hashing 0 to 0, the array elements are modified so that they're all positive (by adding something to each element so that the minimum element is always 1).

import collections
import random

HW = 256
HM = 128
HA = random.randrange(1 << HW)

def HASH(x):
    h = (HA * x) % (1 << HW)
    return h >> (HW - HM)

def smart_evens(A):
    B = 0
    counts = collections.Counter()
    counts[B] += 1
    total = 0
    M = min(A)
    for x in A:
        B = B ^ HASH(x - M + 1)
        total += counts[B]
        counts[B] += 1
    return total

edited Oct 28 '22 at 14:44

answered Oct 28 '22 at 10:39

Paul Hankin

54,811
11
92
118

Hi Paul, appreciate the response. I am struggling to understand the first point you make. Suppose I have an array A = [7,7,1,5,5,1] and then get all the sets of numbers appearing an odd number of times from A[:1] to A[:6]. This gives A[:1] ={7}, A[:2] = {}, A[:3] = {1}, A[:4] = {1,5}, A[:5] = {1} and A[:6} = {}. After counting all the i and j with the same sets, I get A[:2] = A[:6] and A[:3] = A[:5]. This gives me a count of 4 if I take the opposite direction of the equality. But if I let A = [2,2,2,2,2,2,2], I get a count of 18 which is way more than 12. Could you clarify what you mean here? – ISeekTheWisdom Oct 28 '22 at 13:37
1

If you take A=[2, 2, 2, 2, 2, 2, 2] you get sets [{}, {2}, {}, {2}, {}, {2}, {}, {2}]. There's 12 pairs of (i, j) where the sets are equal: (0, 2), (0, 4), (0, 6), (2, 4), (2, 6), (4, 6) for the pairs with {}, and (1, 3), (1, 5), (1, 7), (3, 5), (3, 7), (5, 7) for the pairs with {2}. – Paul Hankin Oct 28 '22 at 14:08
1

In your first calculation you miss A[:0]={}, and the pairs are (0, 2), (0, 6), (2, 6) and (3, 5). – Paul Hankin Oct 28 '22 at 14:10
1

I added code for both the simple case and the optimized case, which maybe helps you understand (or find problems with the idea). – Paul Hankin Oct 28 '22 at 14:26
@PaulHankin It may be simpler to keep a counting hash using a bitvector or similar to keep track of the parities of the first k elements (these are the keys), then add up choose(count, 2) for every key. – Dave Oct 28 '22 at 23:27
Thank you Paul and Dave so much for the amazing ideas. I have to do this code on java unfortunately so there is a limitation on the number of bits I can use for the hash. The solution I came up with converts Paul's code into Java form but for the hash function I decided to use the 64bit hash method from [here](https://stackoverflow.com/a/12996028/20355806). – ISeekTheWisdom Oct 29 '22 at 03:38
@Dave a bitvector is a more efficient way to represent a set of small integers than a general set, but it's uncommon (and unrealistic) to assume that a bitsize of size n is O(1) in space, and comparing two of them is O(1). In practice, your idea is quadratic in time and space in the worst case. – Paul Hankin Oct 29 '22 at 08:19
@ISeekTheWisdom even if you stick to 64-bit arithmetic, you can use multiple hash functions to get more bits. If you find a universal hash function that can give you 32 bits of hash, you can use 4 instances of it (each seeded differently) to get 128 bits of hash. – Paul Hankin Oct 29 '22 at 08:28
@PaulHankin That sounds like a good idea, but I simply cannot visualise what you mean by that. – ISeekTheWisdom Oct 30 '22 at 08:40

Dave · Answer 2 · 2022-10-29T15:34:48.617

Running time: O(n * r) where n is the size of the input array and r is the number of distinct elements. In the worst case r is close to n and this is O(n^2).

Assumptions: Your integers are contiguous or at least small. If you have integers that are much bigger than the number of distinct integers, you should use a hash to give each integer a unique id: 0, 1, 2, ..., and apply my algorithm to these ids.

Initialize the following:

bitvec = 0. We'll use this to keep track of the parity of the number of times we've seen each element so far as we parse the input array.

bitvec_to_count: a hash that has a starting value of zero for each new key, and will keep track of the number of times we've seen each bitvec.

Set bitvec_to_count[0] = 1. We use this to say that we've seen the bitvec representing no elements once.

Now parse the array. For the i'th element of the array, flip the i'th bit of the bitvector (because the parity of that element has changed). Increment the count of this bitvector.

Finally, take the sum of choose(count, 2) for all the counts (value) in the bitvec_to_count hash. This is your answer.

This takes advantage of the fact that good intervals are exactly those where the parity of elements are identical just prior to the start and at the end (because the parity of the elements in the interval itself are all zeros since they all show up an even number of times).

Here's working Ruby code; should be simple to translate to Python.

def f(arr)
  bitvec = 0
  bitvec_to_count = Hash.new{|h, k| h[k] = 0}
  bitvec_to_count[bitvec] += 1
  
  arr.each do |val|
    bitvec ^= 1 << val
    bitvec_to_count[bitvec] += 1
  end
  
  ans = 0
  
  bitvec_to_count.values.each do |count|
    ans += count * (count - 1) / 2 
  end
  
  return ans
end

I don't think this is really linear, since you're depending on operations on n ints of O(n) bits being O(1). Can you run your code on an input list of size 1 million different ints? 10 million? — Paul Hankin, Oct 29 '22 at 05:12

Using Hashing to count the number of occurrences of a pattern in an integer array

2 Answers2