Why is `df.lookup` slower than `df.min`?

Question

I wanted to cut some time by using lookup after idxmin, instead of calling min and idxmin. In my head, the first should be more efficient because in the second the values need to be searched twice (on for the minimum value, and another for the index of the minimum value - i.e. 2 times O(NxM)), whereas in the first, the indexes are searched (O(NxM)) and then the indexes are used to locate the values (O(N))

Please check this question so you understand the context and more details on my reasoning.

The results starting to be unexpected so I proceded with some tests:

I used a dataframe of 100000 rows x 10columns (results get worse by adding more rows):

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
df['min_column'] = df.filter(like='option').idxmin(1)

And then I did some timings:

%timeit -n 100 df.filter(like='option').min(1)
# 12.2 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n 100 df.lookup(df.index, df['min_column'])
# 46.9 ms ± 526 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Notice that even though the min_columns was precalculated for the lookup, the results are 4 times worse than simply looking for the minimum.

Comparison for other sizes:

RowsxCols    min        lookup
100000x10    12.2ms     46.9ms
1000000x10   162ms      682ms
10000x1000   173ms      220ms
1000x10000   295ms      7.97ms

From the above table, as expected the results don't get any better by adding rows (1000000x10), and just a small catch up when adding many more columns(10000x1000). This catch up makes sense, but in my head it should be way bigger, an index should be faster than a search (see updated numpy results), and only in extreme cases (almost unpractical, e.g. 1000x10000) I am starting seeing advantages.

Is there any explanation for this behaviour?

UPDATE:

I tested the this with numpy, and I get the expected behaviour:

vals = np.random.randint(0,10,size=(100000, 10))
%timeit -n 100 np.min(vals, axis=1)
2.83 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

idx_min = np.argmin(vals, axis=1)
%timeit -n 100 vals[np.arange(len(idx_min)), idx_min]
1.63 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comparison results (numpy):

RowsxCols    min        indexing using []
100000x10    2.83ms     1.63ms
1000000x10   24.6ms     15.4ms
100000x100   14.5ms     3.38ms
10000x1000   11.1ms   0.377ms

It seems that `min` and `idxmean` do exactly what you need here. So your question seems a little like asking why is there not some faster way to do exactly what `min` and `idxmin` were designed to do. In which case I would only expect an alternative to be better if `min` and `idxmin` themselves were poorly designed. — JohnE, Apr 25 '19 at 04:44
In terms of the pandas vs numpy issue, I don't know if that is apples to apples as your doing a label based lookup in pandas but a location based lookup in numpy and I'd think the latter would generally be much faster. — JohnE, Apr 25 '19 at 04:45

score 6 · Answer 1 · answered Aug 20 '18 at 18:50

If you look at the source code implementation of lookup function, it does not look to be very efficient. The source code can be found here:

http://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L3435-L3484

Particularly, in the main if-else condition body, it does

if not self._is_mixed_type or n > thresh:
        values = self.values
        ridx = self.index.get_indexer(row_labels)
        cidx = self.columns.get_indexer(col_labels)
        if (ridx == -1).any():
            raise KeyError('One or more row labels was not found')
        if (cidx == -1).any():
            raise KeyError('One or more column labels was not found')
        flat_index = ridx * len(self.columns) + cidx
        result = values.flat[flat_index]

result = np.empty(n, dtype='O')
for i, (r, c) in enumerate(zip(row_labels, col_labels)):
        result[i] = self._get_value(r, c)

I am not sure about detailed implementation of if case, but you might want to try this on a very large number of rows and very large number of column cases and you might get some better results off of lookup function.

You probably should try to define your own lookup table so you'd know exactly the runtime rather than using this lookup function

Why is `df.lookup` slower than `df.min`?

UPDATE:

1 Answers1

Linked