1

Input is 200 K doubles pairs i.e. keypair. they have no range i.e. keys can be any numbers.

Now, the task is to group such big data into the following:

key1 => (value1, value2, value3)
key2 => (value1, value2, value3)
...
...

a EPSILON (e.g. 1e-8) can be considered when comparing equality of key (doubles).

I have developed a O(n^2) solution but it is too slow for 200K doubles. wondering any improvements? such as O(nlogn) would be nice.

Another example,

Input:  <0.1, 0.3>,  <0.1, 0.4>,  <0.2, 0.5>, <0.2, 0.6>, <0.3, 0.7>
Output

0.1 => (0.3, 0.4)
0.2 => (0.5, 0.6)
0.3 => (0.7)
justyy
  • 5,831
  • 4
  • 40
  • 73
  • 1
    How would you handle the case that you have N values that are all within EPSILON of some other value in the consideration set, e.g. (1.0, 1.0 + EPSILON, 1.0 + 2xEPSILON, 1.0 + 3xEPSILON)? – Eric J. May 20 '14 at 19:48
  • Consider the keys are the same. – justyy May 20 '14 at 19:50
  • 1
    So you could have 1,000,000 numbers following that pattern, and they would all have the same key? But if you had (1.0, 1.0 + 2xEPSILON, 1.0 + 4xEPSILON, ...) they would have different keys? Does not make sense to me. – Eric J. May 20 '14 at 19:51
  • doesnt matter , the input wont have much close keys – justyy May 20 '14 at 19:52

2 Answers2

1

Why not sort? Sort according to first value and you are (almost) done. It is O(nlogn).

ElKamina
  • 7,747
  • 28
  • 43
1

In order to avoid the issue of the grouping key depending on other grouping keys - the issue of treating keys like

(1.0, 1.0 + EPSILON, 1.0 + 2xEPSILON, 1.0 + 3xEPSILON)

consistently with keys like

(1.0, 1.0 + 2xEPSILON, 1.0 + 4xEPSILON, ...)

the most logical choice seems to be to use a HashSet and create a hash key by quantizing the actual double value of the key into buckets EPSILON in size.

Depending on your requirements for EPSILON, you might adopt the discussion below to quantize your expected input range into something like a long:

Convert/Quantize Float Range to Integer Range

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • thanks.. I have changed my algorithm to O(n) using hash table, works like a charm... lighting fast! – justyy May 21 '14 at 10:36