9

Given a Pandas DataFrame with lists stored in several of the columns, is there a simple way to find the column name which contains the longest list for each row?

For example, with this data:

                          positive                 negative          neutral
1   [marvel, moral, bold, destiny]                       []   [view, should]
2                      [beautiful]      [complicated, need]               []
3                      [celebrate]   [crippling, addiction]            [big]

I want to identify "positive" as the column with the longest list for row 1 and "negative" for rows 2 and 3.

I thought I could use str.len() to calculate the list lengths and idmax() to get the column names, but can't figure out how to combine them.

user2950747
  • 695
  • 1
  • 6
  • 19

3 Answers3

15

IIUC:

In [227]: df.applymap(len).idxmax(axis=1)
Out[227]:
0    positive
1    negative
2    negative
dtype: object
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Many thanks for this! Quick follow-up: `idxmax()` returns the first index (ie 'positive') when 'positive', 'negative' and 'neutral' are all empty lists. Could this solution be modified to ignore such rows? I've tried replacing the empty lists with NaNs, but `len` then complains: `"object of type 'float' has no len()"`. – user2950747 Aug 23 '17 at 13:30
  • @user2950747, could you provide a reproducible data set? – MaxU - stand with Ukraine Aug 23 '17 at 16:46
  • [this sample data](http://www.sharecsv.com/s/9dcf95cd39fa09eec291085eafecbd0c/polarity.csv) has empty lists in the 3rd row from the end, and would be labelled 'positive' by `idxmax()`. – user2950747 Aug 23 '17 at 18:04
  • @user2950747, do you simply want to drop such rows? – MaxU - stand with Ukraine Aug 23 '17 at 18:06
  • ah... Yes! I wasn't thinking of it like that, but that would be the simplest approach. Thanks! – user2950747 Aug 23 '17 at 18:08
5
>>> df.apply(lambda row: row.apply(len).argmax(), axis=1)
0    positive
1    negative
2    negative
dtype: object
Alexander
  • 105,104
  • 32
  • 201
  • 196
2

Or you can try this ...

df=df.reset_index()
DF=pd.melt(df,id_vars=['index'])
DF['Length']=DF['value'].apply(lambda x : len(x))
DF.sort_values(['index','Length']).drop_duplicates(subset=['index'],keep='last')
BENY
  • 317,841
  • 20
  • 164
  • 234