UPDATE 2:
The numpy solution of @piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.
UPDATE 1:
According to my tests calling min
and idxmin
separatedly is the fastest you can do based on all the proposed answers.
Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup
on the column indexes (min_column
colum), in order to avoid the search for values (min_values
).
So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column
series - which is O(n):
df = pd.DataFrame({
'id': [0,1,2,3,4],
'option_1': [10, np.nan, np.nan, 400, 600],
'option_2': [np.nan, 20, 300, np.nan, 700],
'option_3': [np.nan, 200, 30, np.nan, 50],
'option_4': [110, np.nan, np.nan, 40, 50],
})
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
Direct answer (not as efficient)
Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]
), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.
Bottom line is pandas/numpy vectorized operations are very fast.
Summary of the summary:
There doesn't seem to be any advantage in using df.lookup
, calling min
and idxmin
separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.
Summary of the timings:
I tested a dataframe with 10000 rows and 10 columns (option_
sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:
Using numpy as @piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.
Calling min
and idxmin
separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup
(although, the Dataframe.lookup
result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.
The solution provided by @Wen followed in performance, though it was not better than calling idxmin
and min
separatedly, or using Dataframe.lookup
. I did an extra test (see test7()
) because I felt that the the addition of operation (reset_index
and zip
might be disturbing the result. It was still worse than test1
and test2
, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)
). @Wen, would you mind giving me a hand?
@Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.
Any other solution with lambda expression, or that uses the transpose (T
) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by @piRSquared and @RafaelC.
TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):
Using the following dataframe of 10000 rows and 10 columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
Calling the two columns twice separatedly:
def test1():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)
%timeit -n 100 test1()
13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Calling the lookup (it is slower for this case!):
def test2():
df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])
%timeit -n 100 test2()
# 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using apply
and min_idxmin(x)
:
def min_idxmin(x):
_idx = x.idxmin()
return _idx, x[_idx]
def test3():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
lambda x: min_idxmin(x), axis=1))
%timeit -n 10 test3()
# 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using agg['min', 'idxmin']
by @piRSquared:
def test4():
df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
%timeit -n 1 test4()
# 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using agg['min', 'idxmin']
by @RafaelC:
def test5():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
%timeit -n 1 test5()
# 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sorting values by @Wen:
def test6():
df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
%timeit -n 100 test6()
# 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sorting values by @Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):
def test7():
df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
%timeit -n 100 test7()
# 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy:
def test8():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
min_column = options[np.nanargmin(v, axis=1)])
%timeit -n 100 test8()
# 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy but avoid the search (indexing instead):
def test9():
col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
idxmin = np.nanargmin(v, axis=1)
# instead of looking for the answer, indexes are used
df.assign(min_value = v[range(v.shape[0]), idxmin],
min_column = options[idxmin])
%timeit -n 100 test9()
# 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
TimeIt results with 1000 rows x 1000 columns:
I perform more test with a 1000x1000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()
Although the results change:
test1 ~27.6ms
test2 ~29.4ms
test3 ~135ms
test4 ~1.18s
test5 ~1.29s
test6 ~287ms
test7 ~290ms
test8 ~25.7
test9 ~26.1
TimeIt results with 100 rows x 10000 columns:
I perform more test with a 100x10000 shape:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()
Although the results change:
test1 ~46.8ms
test2 ~25.6ms
test3 ~101ms
test4 ~289ms
test5 ~276ms
test6 ~349ms
test7 ~301ms
test8 ~121ms
test9 ~122ms