Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

Question

I have a dataframe with 5 columns (first column is ID and 4 are country names)

I want a list of country names for each row, that satisfies a particular condition on that respective country column.

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

In the above case I need a IDs for which rating is 4 and above.

I'm expecting the output to be list of companies for where rating was 4 and above for each ID. Can be a dataframe or a dict.

highest_rated_companies = { 'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2'] }

kindly share data not pics, in this case your input dataframe — sammywemmy, Jun 23 '20 at 06:06
What have you tried? Where did it go wrong? Is the output to be understood as a dictionary as you wrote it or implicitly as a data frame again? — David Wierichs, Jun 23 '20 at 06:06
@DavidWierichs The output can be a dictionary or a dataframe. I'm looking for a short code. I got it done using 2 'for loops', but that's not efficient. — Salih, Jun 23 '20 at 06:33

MrNobody33 · Accepted Answer · 2020-06-23T08:10:08.017

You could try something like this, with a to_records, that seems to be the fastest as you can see here:

First Option

import pandas as pd
import numpy as np
data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[1]:[df.columns[idx] for idx,val in enumerate(list(row)[2:],1) if val>=4] for row in df.to_records()}

Second Option

import pandas as pd

data = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(data)
print(df)
highest_rated_companies={row[0]:[df.columns[idx] for idx,val in enumerate(row[1:],1) if val>=4] for i, row in df.iterrows()}
print(highest_rated_companies)

Both outputs:

df:

   id  c1  c2  c3  c4
0  i1   3   4   4   5
1  i2   2   2   5   1
2  i3   4   5   3   2
3  i4   1   5   3   2
4  i5   4   5   3   2


highest_rated_companies:

{'i1': ['c2', 'c3', 'c4'], 'i2': ['c3'], 'i3': ['c1', 'c2'], 'i4': ['c2'], 'i5': ['c1', 'c2']}

Timestamps:

First Option:

0.0113047 seconds best case, when executed 100 times the script
1.2424291999999468 seconds best case, when executed 10000 times the script

Second Option

0.07292359999996734 seconds best case, when executed 100 times the script
7.821904700000005 seconds best case, when executed 10000 times the script

Edit:

Using dt.to_records(), seem to be the fastest way, since I tested Ehsan's answer and I got when executed 10000 times the script, a timestamp of 50.64001639999992 seconds, and when executed 100 times the script, a timestamp of 0.5399872999998934 seconds. Even it's faster than the Second Option, the First Option keep being the fastest.

score 0 · Answer 2 · answered Jun 23 '20 at 07:50

You can do this:

df = pd.DataFrame(df)
keys, values = np.where(df[['c1','c2','c3','c4']].ge(4))
highest_rated_companies = pd.DataFrame({'id':df.iloc[keys].id,'c':df.columns[values+1]})

output:

   id   c
0  i1  c2
0  i1  c3
0  i1  c4
1  i2  c3
2  i3  c1
2  i3  c2
3  i4  c2
4  i5  c1
4  i5  c2

You can easily convert it to a dictionary if you prefer.

score 0 · Answer 3 · answered Jun 23 '20 at 15:45

Another option is to use the to_dict method. If you set your id column as the index, you can do:

df = df[df>=4]
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

The last line is to convert the dictionary into the desired format. Time test:

In[0]:
import pandas as pd

df = {'id':['i1','i2','i3','i4','i5'], 'c1':[3,2,4,1,4], 'c2':[4,2,5,5,5], 'c3':[4,5,3,3,3], 'c4':[5,1,2,2,2]}

df = pd.DataFrame(df)
df = df.set_index('id',drop=True)
df = df[df>=4]

%%timeit -n 1000
d = df.to_dict('index')
output = {ID: [name for name,val in row.items() if not pd.isnull(val)] for ID, row in d.items()}

Out[0]
243 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This isn't as fast as what @MrNobody33 answered, though: 135 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Efficient (& perhaps shortest) way of getting a list of column names that satisfies a condition, for each row

3 Answers3