8

I was wondering if there is a possibility of calling idxmin and min at the same time (in the same call/loop).

Assuming the following dataframe:

    id  option_1    option_2    option_3    option_4
0   0   10.0        NaN         NaN         110.0
1   1   NaN         20.0        200.0       NaN
2   2   NaN         300.0       30.0        NaN
3   3   400.0       NaN         NaN         40.0
4   4   600.0       700.0       50.0        50.0

I would like to calculate the minimum value (min) and the column that contains it (idxmin) of the option_ series:

    id  option_1    option_2    option_3    option_4    min_column  min_value
0   0   10.0        NaN         NaN         110.0       option_1        10.0
1   1   NaN         20.0        200.0       NaN         option_2        20.0
2   2   NaN         300.0       30.0        NaN         option_3        30.0
3   3   400.0       NaN         NaN         40.0        option_4        40.0
4   4   600.0       700.0       50.0        50.0        option_3        50.0

Obviously, I can call idxmin and min separatedly (one after the other, see example below), but is there a way of making this more efficient without searching the matrix twice (one for the value and another for the index)?


An example calling min and idxmin

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [0,1,2,3,4], 
    'option_1': [10,     np.nan, np.nan, 400,    600], 
    'option_2': [np.nan, 20,     300,    np.nan, 700], 
    'option_3': [np.nan, 200,    30,     np.nan, 50],
    'option_4': [110,    np.nan, np.nan, 40,     50], 
})

df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.filter(like='option').min(1)

(I expected this would be suboptimal as the search is performed twice.)

smci
  • 32,567
  • 20
  • 113
  • 146
toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • The intuitive assumption *"Clearly this is suboptimal as the search is performed twice.*" is wrong due to vectorization, per [your benchmarks](https://stackoverflow.com/questions/51932428/obtain-min-and-idxmin-or-max-and-idxmax-at-the-same-time-simultaneou/51932429#51932429). Turns out two separate vectorized calls are **fastest**. Given that, please edit the premise of the question. Are you just curious whether it can be done, or willing to take a huge performance hit for a more compact syntax that eliminates one line of code, or what? – smci Aug 21 '18 at 03:29
  • I don't fully I understand your comment, the bolded question states **is there a way of making this more efficient without searching the matrix twice?** (I am looking for improving performance). I found that the lookup is just better with extreme cases, just because a `lookup` is slower than a `min` search, after realizing, I asked another [question here](https://stackoverflow.com/questions/51934952/why-is-df-lookup-slower-than-df-min) – toto_tico Aug 21 '18 at 07:33
  • Because your premise *"searching the df twice"* ⇏ *"less efficient"*. So either you could ask *"How to make this more efficient?"* or *"How to avoid searching the df twice?"* But not both. (By the way it's a (pandas) dataframe not a (numpy) matrix. If you did this on a numpy matrix the most efficient methods would likely be different again). – smci Aug 22 '18 at 23:08
  • @smci, the premise of the question is based on theory (not an "intutive assumption") that directly accessing values through indexes (*direct accesses*) in an array is `O(1)` and searching is `O(n)`. In practice, the `k` of direct accesses seems to be very big, and vectorized searches are very fast. I *assumed* that direct accesses would have been also vectorized (am I wrong?), in which case we are comparing `searches` vs `direct access`, both vectorized - again, my assumption for this could be wrong. – toto_tico Aug 23 '18 at 08:07
  • @smci, I used the term `matrix` in the mathematical sense, only to reffer to the values of the pandas dataframe that I am interested in. I don't know exactly how the values are represented internally in pandas, but I know that numpy is operating on the background. I don't mind if the solution is in `pandas` or `numpy`. In any case, in the `test9` I also tried to use direct accesses in numpy (instead of two searches), still, searching twice was more efficient. – toto_tico Aug 23 '18 at 08:12
  • I'd use the term *"the values"* or *"the values in those rows and columns"*. pandas `.lookup()` is slower than `[]`/`.iloc()` because it takes labels (not indices). As your answer showed, the more columns, the slower it gets. Also it helps if you state what range of typical or max dimensions you care about: 10,000 columns or more? 1,000 rows or more? Also is your matrix sparse, do you care about the NaN entries, can the NaNs be sparsely represented? See [pandas Sparse data structures](https://pandas.pydata.org/pandas-docs/stable/sparse.html) – smci Aug 23 '18 at 21:48
  • *"the premise of the question is based on theory that directly accessing values through indexes (direct accesses) in an array is O(1) and searching is O(n)"* Yes, that's just repeating that **lookups (using labels) are slower than direct indexing (with indices)**. As distinct to whether you make one or two accesses. You could directly index with `[]`/`.iloc()`, I'd avoid `.lookup()` and `.loc()` for speed. Many existing pandas Q&A say that. – smci Aug 23 '18 at 22:04
  • Related: [pandas iloc vs ix vs loc explanation, how are they different?](https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation-how-are-they-different), [pandas loc vs. iloc vs. ix vs. at vs. iat?](https://stackoverflow.com/questions/28757389/pandas-loc-vs-iloc-vs-ix-vs-at-vs-iat), [Is .ix() always better than .loc() and .iloc() since it is faster and supports integer and label access?](https://stackoverflow.com/questions/27667759/is-ix-always-better-than-loc-and-iloc-since-it-is-faster-and-supports-i) – smci Aug 23 '18 at 22:06
  • ... [Comparison of Pandas lookup times](https://stackoverflow.com/questions/38254067/comparison-of-pandas-lookup-times) – smci Aug 23 '18 at 22:14
  • @smci, I understand the difference between `loc`, `iloc`, `ix` (deprecated), `at` and `iat`. None of which applies to my case, in which a I have a column with labels. In each row, I need to access a different column. As for the use of labels in lookups, point taken - I will comment on that. – toto_tico Aug 24 '18 at 08:06
  • 1
    Do you happen to know how the labels are transformed into direct indexes? I would have thought it was a dictionary-like (hash-table), in which case more columns shouldn't make them slower. Wait, actually, what you said is not what my results show. It is not possible to conclude anything regarding the lookup and number of columns with my current results. (Now I tested it, the results are pretty much the same for 1000 rows and 100, 1000, 10000 columns, but for 10 columns where ~5x faster) – toto_tico Aug 24 '18 at 08:23
  • @smci, would you mind to continue commenting [here](https://stackoverflow.com/questions/51934952/why-is-df-lookup-slower-than-df-min). I feel like all the latter comments don't belong to here, but to that other questions. – toto_tico Aug 24 '18 at 08:33

3 Answers3

5

Google Colab
GitHub

transpose then agg

df.set_index('id').T.agg(['min', 'idxmin']).T

  min    idxmin
0  10  option_1
1  20  option_2
2  30  option_3
3  40  option_4
4  50  option_3

Numpy v1

d_ = df.set_index('id')
v = d_.values
pd.DataFrame(dict(
    Min=np.nanmin(v, axis=1),
    Idxmin=d_.columns[np.nanargmin(v, axis=1)]
), d_.index)

      Idxmin   Min
id                
0   option_1  10.0
1   option_2  20.0
2   option_3  30.0
3   option_4  40.0
4   option_3  50.0

Numpy v2

col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
pd.DataFrame(dict(
    Min=np.nanmin(v, axis=1),
    IdxMin=options[np.nanargmin(v, axis=1)]
))

Full Simulation

Conclusion

The Numpy solutions are fastest.

Results

10 columns

         pir_agg_1  pir_agg_2  pir_agg_3  wen_agg_1  tot_agg_1  tot_agg_2
10       12.465358   1.272584        1.0   5.978435   2.168994   2.164858
30       26.538924   1.305721        1.0   5.331755   2.121342   2.193279
100      80.304708   1.277684        1.0   7.221127   2.215901   2.365835
300     230.009000   1.338177        1.0   5.869560   2.505447   2.576457
1000    661.432965   1.249847        1.0   8.931438   2.940030   3.002684
3000   1757.339186   1.349861        1.0  12.541915   4.656864   4.961188
10000  3342.701758   1.724972        1.0  15.287138   6.589233   6.782102

enter image description here

100 columns

        pir_agg_1  pir_agg_2  pir_agg_3  wen_agg_1  tot_agg_1  tot_agg_2
10       8.008895   1.000000   1.977989   5.612195   1.727308   1.769866
30      18.798077   1.000000   1.855291   4.350982   1.618649   1.699162
100     56.725786   1.000000   1.877474   6.749006   1.780816   1.850991
300    132.306699   1.000000   1.535976   7.779359   1.707254   1.721859
1000   253.771648   1.000000   1.232238  12.224478   1.855549   1.639081
3000   346.999495   2.246106   1.000000  21.114310   1.893144   1.626650
10000  431.135940   2.095874   1.000000  32.588886   2.203617   1.793076

enter image description here

Functions

def pir_agg_1(df):
  return df.set_index('id').T.agg(['min', 'idxmin']).T

def pir_agg_2(df):
  d_ = df.set_index('id')
  v = d_.values
  return pd.DataFrame(dict(
      Min=np.nanmin(v, axis=1),
      IdxMin=d_.columns[np.nanargmin(v, axis=1)]
  ))

def pir_agg_3(df):
  col_mask = df.columns.str.startswith('option')
  options = df.columns[col_mask]
  v = np.column_stack([*map(df.get, options)])
  return pd.DataFrame(dict(
      Min=np.nanmin(v, axis=1),
      IdxMin=options[np.nanargmin(v, axis=1)]
  ))

def wen_agg_1(df):
  v = df.filter(like='option')
  d = v.stack().sort_values().groupby(level=0).head(1).reset_index(level=1)
  d.columns = ['IdxMin', 'Min']
  return d

def tot_agg_1(df):
  """I combined toto_tico's 2 filter calls into one"""
  d = df.filter(like='option')
  return df.assign(
      IdxMin=d.idxmin(1),
      Min=d.min(1)
  )

def tot_agg_2(df):
  d = df.filter(like='option')
  idxmin = d.idxmin(1)
  return df.assign(
      IdxMin=idxmin,
      Min=d.lookup(d.index, idxmin)
  )

Sim setup

def sim_df(n, m):
  return pd.DataFrame(
      np.random.randint(m, size=(n, m))
  ).rename_axis('id').add_prefix('option').reset_index()


fs = 'pir_agg_1 pir_agg_2 pir_agg_3 wen_agg_1 tot_agg_1 tot_agg_2'.split()
ix = [10, 30, 100, 300, 1000, 3000, 10000]

res_small_col = pd.DataFrame(index=ix, columns=fs, dtype=float)
res_large_col = pd.DataFrame(index=ix, columns=fs, dtype=float)

for i in ix:
  df = sim_df(i, 10)
  for j in fs:
    stmt = f"{j}(df)"
    setp = f"from __main__ import {j}, df"
    res_small_col.at[i, j] = timeit(stmt, setp, number=10)

for i in ix:
  df = sim_df(i, 100)
  for j in fs:
    stmt = f"{j}(df)"
    setp = f"from __main__ import {j}, df"
    res_large_col.at[i, j] = timeit(stmt, setp, number=10)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    or `df.agg(lambda x: x.agg(['min', 'idxmin']), axis=1)` and no `.T` – rafaelc Aug 20 '18 at 14:12
  • 1
    Btw, `pandas 0.24.0` will allow to do `df.agg(['min', 'idxmin'], 1)` directly, check [here](https://github.com/pandas-dev/pandas/issues/16679) and [here](https://github.com/pandas-dev/pandas/pull/21224) ;} – rafaelc Aug 20 '18 at 14:21
  • 1
    Thanks @RafaelC that is something to look forward to. I tried that instinctively with the version I'm using but obviously didn't work. – piRSquared Aug 20 '18 at 14:25
  • This answer is generally speaking much less efficient, specially if you don't have a lot of columns. Please check my answers with that include timings. – toto_tico Aug 20 '18 at 16:35
  • 1
    I would have expected as much. I don't see that you've tested my second solution which has been there for 3 hours. – piRSquared Aug 20 '18 at 17:37
  • @piRSquared, awesome results. I want to boost a bottle neck in my code, and the numpy answer will certainly help. Could you copy the numpy v2 answer at the beginning with a `df.assign` operation that creates the `min_column` and `min_value` (this is to keep it consistent with the example in the question)? (Please check my answer to see what I mean). After that, I will mark this as the answer. Also, could you take a look to the `test9`, the numpy `lookup` equivalent of pandas? I have troubles accepting that searches are faster than indexing, do you know a better way? – toto_tico Aug 21 '18 at 09:15
2

UPDATE 2:

The numpy solution of @piRSquared is the winner for what I would consider the most common cases. Here is his answers with a minimum modification to assign the columns to the original dataframe (which I did in all my tests, in order to be consistent with the example of the original question)

col_mask = df.columns.str.startswith('option')
options = df.columns[col_mask]
v = np.column_stack([*map(df.get, options)])
df.assign(min_value = np.nanmin(v, axis=1),
          min_column = options[np.nanargmin(v, axis=1)])

You should be careful if you have a lot of columns (more than 10000), since in these extreme cases results could start changing significatively.

UPDATE 1:

According to my tests calling min and idxmin separatedly is the fastest you can do based on all the proposed answers.


Although it is not at the same time(see direct answer below), you should be better of using DataFrame.lookup on the column indexes (min_column colum), in order to avoid the search for values (min_values).

So, instead of traversing the entire matrix - which is O(n*m), you would only traverse the resulting min_column series - which is O(n):

df = pd.DataFrame({
    'id': [0,1,2,3,4], 
    'option_1': [10,     np.nan, np.nan, 400,    600], 
    'option_2': [np.nan, 20,     300,    np.nan, 700], 
    'option_3': [np.nan, 200,    30,     np.nan, 50],
    'option_4': [110,    np.nan, np.nan, 40,     50], 
})

df['min_column'] = df.filter(like='option').idxmin(1)
df['min_value'] = df.lookup(df.index, df['min_column'])

Direct answer (not as efficient)

Since you asked about how to calculate the values "in the same call" (let's say because you simplified your example for the question), you can try a lambda expression:

def min_idxmin(x):
    _idx = x.idxmin()
    return _idx, x[_idx]

df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
    lambda x: min_idxmin(x), axis=1))

To be clear, although here the 2nd search is removed (replaced by a direct acccess in x[_idx]), this will highly likely take much longer because you are not exploitng the vectorizing properties of pandas/numpy.

Bottom line is pandas/numpy vectorized operations are very fast.



Summary of the summary:

There doesn't seem to be any advantage in using df.lookup, calling min and idxmin separatedly is better, than using the lookup which is mind blowing and deserves a question in itself.

Summary of the timings:

I tested a dataframe with 10000 rows and 10 columns (option_ sequence in the initial example). Since, I got a couple of unexpected result, I then also tested with 1000x1000, and 100x10000. According to the results:

  1. Using numpy as @piRSquared (test8) suggested is the clear winner, only start perfoming worse when there is a lot of columns (100, 10000, but does not justify the general use of it). The test9 modifies trying to using index in numpy, but it generally speaking performs worse.

  2. Calling min and idxmin separatedly was the best for the 10000x10 case, even better than the Dataframe.lookup (although, the Dataframe.lookup result performed better in the 100x10000 case). Although the shape of the data influence the results, I would argue that having 10000 columns is a bit unrealistic.

  3. The solution provided by @Wen followed in performance, though it was not better than calling idxmin and min separatedly, or using Dataframe.lookup. I did an extra test (see test7()) because I felt that the the addition of operation (reset_index and zip might be disturbing the result. It was still worse than test1 and test2, even though it does not do the assigment (I couldn't figure out how to make the assigment using the head(1)). @Wen, would you mind giving me a hand?

  4. @Wen solution underperfoms when there are more columns (1000x1000, or 100x10000), which makes sense because sorting is slower than searching. In this case, the lambda expression that I suggested performs better.

  5. Any other solution with lambda expression, or that uses the transpose (T) falls behind. The lambda expression that I suggested took around 1 second, better than the ~11 secs using the transpose T suggested by @piRSquared and @RafaelC.

TimeIt results with 10000 rows x 10 columns (pandas 0.23.4):

Using the following dataframe of 10000 rows and 10 columns:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10000, 10)), columns=[f'option_{x}' for x in range(1,11)]).reset_index()
  1. Calling the two columns twice separatedly:

    def test1():
        df['min_column'] = df.filter(like='option').idxmin(1)
        df['min_value'] = df.filter(like='option').min(1)
    %timeit -n 100 test1()
    13 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  2. Calling the lookup (it is slower for this case!):

    def test2():
        df['min_column'] = df.filter(like='option').idxmin(1)
        df['min_value'] = df.lookup(df.index, df['min_column'])    
    %timeit -n 100 test2()
    # 15.7 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  3. Using apply and min_idxmin(x):

    def min_idxmin(x):
        _idx = x.idxmin()
        return _idx, x[_idx]
    
    def test3():
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').apply(
            lambda x: min_idxmin(x), axis=1))
    %timeit -n 10 test3()
    # 968 ms ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
  4. Using agg['min', 'idxmin'] by @piRSquared:

    def test4():
        df['min_column'], df['min_value'] = zip(*df.set_index('index').filter(like='option').T.agg(['min', 'idxmin']).T.values)
    
    %timeit -n 1 test4()
    # 11.2 s ± 850 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  5. Using agg['min', 'idxmin'] by @RafaelC:

    def test5():
    
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').agg(lambda x: x.agg(['min', 'idxmin']), axis=1).values)
        %timeit -n 1 test5()
        # 11.7 s ± 597 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  6. Sorting values by @Wen:

    def test6():
        df['min_column'], df['min_value'] = zip(*df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1).values)
    
    %timeit -n 100 test6()
    # 33.6 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  7. Sorting values by @Wen modified by me to make the comparison fairer due to overload of assigment operation (I explained why in the summary at the beginning):

    def test7():
        df.filter(like='option').stack().sort_values().groupby(level=[0]).head(1)
    
    %timeit -n 100 test7()
    # 25 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  8. Using numpy:

    def test8():
        col_mask = df.columns.str.startswith('option')
        options = df.columns[col_mask]
        v = np.column_stack([*map(df.get, options)])
        df.assign(min_value = np.nanmin(v, axis=1),
                  min_column = options[np.nanargmin(v, axis=1)])
    
    %timeit -n 100 test8()
    # 2.76 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  9. Using numpy but avoid the search (indexing instead):

    def test9():
        col_mask = df.columns.str.startswith('option')
        options = df.columns[col_mask]
        v = np.column_stack([*map(df.get, options)])
        idxmin = np.nanargmin(v, axis=1)
        # instead of looking for the answer, indexes are used
        df.assign(min_value = v[range(v.shape[0]), idxmin],
                  min_column = options[idxmin])
    
    %timeit -n 100 test9()
    # 3.96 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

TimeIt results with 1000 rows x 1000 columns:

I perform more test with a 1000x1000 shape:

df = pd.DataFrame(np.random.randint(0,100,size=(1000, 1000)), columns=[f'option_{x}' for x in range(1,1001)]).reset_index()

Although the results change:

test1    ~27.6ms
test2    ~29.4ms
test3    ~135ms
test4    ~1.18s
test5    ~1.29s
test6    ~287ms
test7    ~290ms
test8    ~25.7
test9    ~26.1

TimeIt results with 100 rows x 10000 columns:

I perform more test with a 100x10000 shape:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 10000)), columns=[f'option_{x}' for x in range(1,10001)]).reset_index()

Although the results change:

test1    ~46.8ms
test2    ~25.6ms
test3    ~101ms
test4    ~289ms
test5    ~276ms
test6    ~349ms
test7    ~301ms
test8    ~121ms
test9    ~122ms
toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • I'm curious about whether you timed this? – roganjosh Aug 20 '18 at 14:04
  • It is a suggestion :) I can't test atm but I can see the logic of your question, I'm just curious whether the solution translates to a decent speed up – roganjosh Aug 20 '18 at 14:09
  • @roganjosh, I added some timing. Suprisingly, [`min` seems to be much faster than the `lookup`](https://stackoverflow.com/questions/51934952/why-is-df-lookup-slower-than-df-min) – toto_tico Aug 20 '18 at 16:37
  • Your version of my first function is not representative of the function. You are using a filter and zip and assignment into an existing dataframe. None of that was in my function. I have no issue with someone timing my functions but I do have an issue with an attempt to show timings that aren't representative. This can be misleading. – piRSquared Aug 20 '18 at 19:16
  • @piRSquared, I did that for all that requires the assignment. I won't make a difference. For example, look at `test6` and `test7`, the overhead is way below `10ms` (by removing the assignment, the zip(*...) and the reset of the index). You solutions is on the range of seconds, 10ms won't change that. Let me know if you still want to see the results without the assignments. Btw, the overhead of the zip is minimal, it is just a Python iterator, and all the other functions has the assignment (except for test7 as explained in the answer). – toto_tico Aug 20 '18 at 19:42
  • 1
    That might be correct (I suspect you're right) but it is inconsistent. Plus you never tested my second proposal which is intended to be performant. My first proposal is intended to be succinct. Also, when you ask a question, you should state that performance is important. It has a significant impact on the direction of the answer. – piRSquared Aug 20 '18 at 19:47
  • @piRSquared, ah? it is bolded in the question. – toto_tico Aug 20 '18 at 19:48
  • That is fair. I misunderstood "efficient". It could be construed differently. But still, I admit it was there. – piRSquared Aug 20 '18 at 19:49
  • @piRSquared, I didn't test your numpy solution because I couldn't easily transform it to an equivalent function to the others. You are calling `nanmin` and `nanargmin`, I suspect that is what is called on the background with pandas `min` and `argmin`, though I admit I might be wrong. Now that I think about it, after the weird behavior of the `lookup`, a version of your numpy answer might be faster. Do you know if there is a lookup version of numpy? – toto_tico Aug 20 '18 at 19:51
2

Maybe using stack with groupby

v=df.filter(like='option')
v.stack().sort_values().groupby(level=[0]).head(1).reset_index(level=1)
Out[313]:
    level_1     0
0  option_1  10.0
1  option_2  20.0
2  option_3  30.0
3  option_4  40.0
4  option_3  50.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • this answer is good enough if there is a few columns, and, yet, calling `min` and `idxmin` separatedly is better (see my timing test results). The lookup that I suggested is the 2nd best, although theoretically, it should be better but for some reason it seems than [`lookup` is slower than `min`](https://stackoverflow.com/questions/51934952/why-is-df-lookup-slower-than-df-min) – toto_tico Aug 20 '18 at 16:34
  • @toto_tico, just to make sure, it was not me, I actually think your answer is quite interesting (I think it might be the best when one needs to do multiple searches using masks), and yeah I hate that there is no explanation associated to downvotes... – toto_tico Aug 21 '18 at 07:27
  • @toto_tico I konw man , I just notice you have share a lot of information , thanks buddy . !! – BENY Aug 21 '18 at 14:36