How to speed up a line-by-line comparison?

Question

I have an array of indices, some of which are repeats, and I want to loop through all unique values and identify which instance of a given index corresponds to the highest value of an array of the same length. I have a working code for this, but to run through the hundreds of thousands (sometimes millions) of indices takes time, and I was wondering if there was a faster way. I suspect my use of np.append is the problem but I haven't come across a better way.

import numpy as np

# Randomly draw 100,000 integers between 1 and 100,000 as an example
index_list = np.random.randint(low=1, high=100000, size=100000)
# The data array being indexed
vals = np.linspace(1.,100000.,100000)

unique_index = np.unique(index_list)

biggest = []
for line in unique_index:   
    biggest = np.append(biggest,max(vals[np.where(index_list == line)]))

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

You can use a lambda expression to do reduce the time of the operation. The lambda will run up to a second faster for your example.

result = list(map(lambda x : max(vals[np.where(index_list == x)]),unique_index))

EDIT

After shadownRanger's suggestion for a list comprehension approach.

I did compare all 3 ways to see whats faster.

import numpy as np
from datetime import datetime
# Randomly draw 100,000 integers between 1 and 100,000 as an example
index_list = np.random.randint(low=1, high=100000, size=100000)
# The data array being indexed
vals = np.linspace(1.,100000.,100000)

unique_index = np.unique(index_list)

time1 = datetime.now()
biggest = []
for line in unique_index:   
    biggest = np.append(biggest,max(vals[np.where(index_list == line)]))
time2 = datetime.now()
print ('time diff with for loop =', time2-time1)

time1 = datetime.now()
with_list_comprehension = [max(vals[np.where(index_list == x)]) for x in unique_index]
time2 = datetime.now()
print ('time diff with list comprehension =', time2-time1)

time1 = datetime.now()
lambda_way = list(map(lambda x : max(vals[np.where(index_list == x)]),unique_index))
time2 = datetime.now()
print ('time diff with map lambda =', time2-time1)

And the results are the following.

time diff with for loop = 0:00:05.194609

time diff with list comprehension = 0:00:04.133024

time diff with map lambda = 0:00:04.120000

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 17 '19 at 10:50

Nergon

443
2
14

1

An equivalent `list` comprehension will run faster. Any time you think "I need a `lambda` to use `map`/`filter`" the answer is "Use an equivalent list comprehension or generator expression"; they'll be more readable, and run faster to boot. In this case, `result = [max(vals[np.where(index_list == x)]) for x in unique_index]`. Feeding an equivalent generator expression to `numpy.fromiter` would save on memory, while avoiding the `O(n**2)` behavior of repeated calls to `numpy.append`. – ShadowRanger Oct 17 '19 at 11:17
Updated answer. – Nergon Oct 17 '19 at 11:46
1

Appreciate the input from you both. I kept looking, and there is a relevant resource [here](https://stackoverflow.com/questions/18452591/fast-python-numpy-where-functionality) which discusses the list comprehension suggestion made by ShadowRanger. Both of your suggestions make a huge improvement over my code; I ran the %timeit function on my real data (1019247 indices, of which 218981 are unique): my code: 3min 2s ± 46.4 s per loop; list comprehension: 20.6 s ± 1.58 s per loop; lambda: 5.97 s ± 124 ms per loop so it looks like Billatron's answer is the best for my situation. Thanks again! – bxg682 Oct 17 '19 at 12:20
I made an error when timing lambda and fed it a much shorter array. With the true array, list comprehension is much faster - I haven't done an official timing (taking too long), but the lambda approach is much closer to my original code's timing. List comprehension seems to be the best approach. – bxg682 Oct 17 '19 at 12:31

How to speed up a line-by-line comparison?

1 Answers1