2

I need to count number of unique elements in a set of given ranges. My input is the start and end coordinates for these ranges and I do the following.

>>>coordinates
 [[7960383, 7961255],
 [15688414, 15689284],
 [19247797, 19248148],
 [21786109, 21813057],
 [21822367, 21840682],
 [21815951, 21822369],
 [21776839, 21783355],
 [21779693, 21786111],
 [21813097, 21815959],
 [21776839, 21786111],
 [21813097, 21819613],
 [21813097, 21822369]]
 [21813097, 21822369]]
>>>len(set(chain(*[range(i[0],i[1]+1) for i in coordinates])))   #here chain is from itertools

Problem is that it is not fast enough. This is taking 3.5ms (found using %timeit) on my machine (buying a new computer is not an option) and since I need to do this on millions of sets, it is not fast.

Any suggestions how this could be proved?

Edit: The number of rows can vary. In this case there are 12 rows. But I can't put any upper limit on it.

Manish Goel
  • 843
  • 1
  • 8
  • 21
  • 1
    How many rows do you have in it? – Divakar Jul 20 '17 at 13:54
  • 1
    you are commenting on Answers with timings. Help folks out and tell us what size you are testing with and what your processor speed is. BTW there is a super fast numba solution but it's not clear it's needed yet. – Phil Cooper Jul 20 '17 at 14:02
  • If by size you meant the size of coordinates then it is same as in OP. And I need a method which is "comparatively" faster than my solution, so I guess by telling the time the proposed solution is taking on my machine I can let them know how fast there solution is. My processor speed is irrelevant I assume!! – Manish Goel Jul 20 '17 at 14:07
  • This is a duplicate question. Why do you use pure python code here, which is very slow on practically all mathematical operations? Take a look at these answeres. https://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array – max9111 Jul 21 '17 at 12:42

3 Answers3

5

You could just take the difference between the coordinates, and subtract overlapping:

coordinates =[
    [ 7960383,  7961255],
    [15688414, 15689284],
    [19247797, 19248148],
    [21776839, 21786111],
    [21813097, 21819613],
    [21813097, 21822369]
]

# sort by increasing first coordinate, and if equal, by second:
coordinates.sort()

count = 0
prevEnd = 0
for start, end in coordinates:
    if end > prevEnd: # ignore a range that is sub-range of the previous one
        count += end - max(start, prevEnd)
        prevEnd = end

print (count)

This is both cheap in space and time.

Inclusive end coordinates

After your edit, it became clear you wanted the second coordinate to be inclusive. In that case "correct" the calculation like this:

count = 0
prevEnd = -1
for start, end in coordinates:
    if end > prevEnd: # ignore a range that is sub-range of the previous one
        count += end - max(start - 1, prevEnd)
        prevEnd = end
trincot
  • 317,000
  • 35
  • 244
  • 286
  • 4.2microsec. Great!! – Manish Goel Jul 20 '17 at 14:13
  • Small issue: when start > prevEnd, then need to add extra +1 to include the start position. – Manish Goel Jul 20 '17 at 14:38
  • 1
    This should not be necessary when the end coordinates are assumed to not be inclusive (like also you have done in your code with the use of `range()`). If however, you expect the second coordinate to be inclusive, then you need indeed to take precautions, but that is rather "unpythonic". – trincot Jul 20 '17 at 14:46
  • Ahhh... thanks for mentioning. That is indeed my fault. I do need to include the end position and have edited the OP. And I guess your proposed solution can do that easily by using `start-1` instead of just `start` – Manish Goel Jul 20 '17 at 14:52
  • Indeed. Theoretically you would then also initialise `prevEnd = -1`. – trincot Jul 20 '17 at 15:00
  • Nice, I think you can use `sorted(coordinates) ` and the itemgetter thing isn't needed. – Phil Cooper Jul 20 '17 at 16:08
0

With NumPy you can do:

import numpy as np

coordinates = ...
nums = np.concatenate([np.arange(start, end) for start, end in coordinates], axis=0)
num_unique = len(np.unique(nums))

Update

If you can afford having a matrix with as many rows as the number of coordinates and as many columns as the biggest number another option would be:

import numpy as np

coordinates = np.asarray(coordinates)
nums = np.tile(np.arange(np.max(coordinates)), (len(coordinates), 1))
m = (nums >= coordinates[:, :1]) & (nums < coordinates[:, 1:])
num_unique = np.count_nonzero(np.logical_or.reduce(m, axis=0))
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • It is still taking 2ms. Can it be improved more? – Manish Goel Jul 20 '17 at 13:58
  • @ManishGoel I was about to add the approach of [Divakar](https://stackoverflow.com/users/3293881/divakar)'s answer as an alternative, you can check that one. – jdehesa Jul 20 '17 at 14:00
  • 1
    @ManishGoel I've added another alternative for the sake of study, but it's not likely to be faster than [trincot](https://stackoverflow.com/users/5459839/trincot)'s solution (and it's quite memory expensive). – jdehesa Jul 20 '17 at 14:16
  • Tried it. Took 1.2sec. Any idea why it would be slow?? maybe took extra time to initiate huge matrix? – Manish Goel Jul 20 '17 at 14:22
  • @ManishGoel Yeah that's most likely. In general (in my experience), NumPy (or array libraries in general) transition quite quickly from incredibly fast to terribly slow when you start to step out of reasonable array sizes. – jdehesa Jul 20 '17 at 15:27
0

Maybe this is better?

len(reduce(lambda x, y: set(x).union(set(y)), array)   
yoav_aaa
  • 367
  • 2
  • 11