Combine Pandas columns containing list objects

Question

My question is similar to 'Combine pandas string columns with missing values ' and 'How to merge/combine columns in pandas?' but as the cells I want to combine contain lists, all the given answers don't work.

Simplified, my df looks like this:

        players    players1    players2    players3
1       ['1','2']  
2       ['2','4']
3                  ['1','4']
4                              ['1','5']
5                                          ['3','5']
6
7                  ['3','4']

(So i know there will never be values in two of the columns.)

Now I want to add a new column combining the lists:

        players    players1    players2    players3     players_combine
1       ['1','2']                                       ['1','2']
2       ['2','4']                                       ['2','4']
3                  ['1','4']                            ['1','4']
4                              ['1','5']                ['1','5']
5                                          ['3','5']    ['3','5']
6
7                  ['3','4']                            ['3','4']

I tried many things - mainly variations of the linked answers, my last idea was a cascading np.where. But it didn't work. Only the values from 'players' showed up in the combined column.

df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

EDIT: As asked in the comments: df.head(5).to_dict()

{'players': {'5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan'}, 'players1': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': ''}, 'players2': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': ''}, 'players3': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']}}

You cannot have "no values" in Pandas cells. What exactly do you have in them? Empty strings, empty lists, nans? — DYZ, Sep 16 '18 at 20:08
What's the empty value in those other columns? Is it an empty string, or NaN, or an empty list? — ALollz, Sep 16 '18 at 20:08
Yeah, good question. Most of them seem to have np.nan - but strangely, when I combined the cells into lists in one of my efforts, some of the lists looked like this: ['', '', [1,3], nan]. I guess these are empty strings? — J_Scholz, Sep 16 '18 at 20:11
if you post `d = df.to_dict()`, we can simply do `pd.DataFrame(d)` and reproduce the data frame exactly as is :) — rafaelc, Sep 16 '18 at 20:12
@RafaelC Are you aiming at the same information - what is inside the empty cells? — J_Scholz, Sep 16 '18 at 20:13
@RafaelC Ah ok. The df is huge (40.000 rows). But i can try to make an excerpt. — J_Scholz, Sep 16 '18 at 20:15

DYZ · Answer 1 · 2018-09-16T20:50:32.200

2

If any of the empty cells is an empty string (''), first make it a NaN:

df[df==''] = np.nan

Then, select the maximum of all non-NaNs in each row:

df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#6       NaN
#7    [3, 4]

Another interesting (and faster) solution is to eliminate the all-NaN rows and then find the first valid value in each row:

df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#7    [3, 4]

edited Sep 16 '18 at 20:50

answered Sep 16 '18 at 20:21

DYZ

55,249
10
64
93

Thank you, the np.where solution is even faster than this. See my answer. – J_Scholz Sep 19 '18 at 20:14

ALollz · Answer 2 · 2018-09-16T20:30:42.343

Since you know you will only a value in at most a single column for each row, you can replace the values you don't care about with NaN and then use .stack.

In this case it looks like you have both 'nan' and '' strings which should be replaced with np.NaN first.

import numpy as np

df['players_combine'] = df.replace({'': np.NaN, 'nan': np.NaN}, regex=True).stack().reset_index(level=1, drop=True)

Sample Data

import pandas as pd
df = pd.DataFrame({'players': [['1','2'], '', '', np.NaN, ''],
                   'players1': ['', ['2','4'], '', np.NaN, ''],
                   'players2': ['', '', ['1','5'], np.NaN, ''],
                   'players3': ['', '', np.NaN, ['3', '5'], '']})

Output:

  players players1 players2 players3 players_combine
0  [1, 2]                                     [1, 2]
1           [2, 4]                            [2, 4]
2                    [1, 5]      NaN          [1, 5]
3     NaN      NaN      NaN   [3, 5]          [3, 5]
4                                                NaN

score 1 · Answer 3 · answered Sep 16 '18 at 20:40

1

Analyzing your df.to_dict() it seems like, for some cells, you have the string nan, for others your have the actual np.nan and, for some others, you even have empty strings ''.

So first clean your data set by making null values uniform:

df = df.replace({'nan':np.nan, '':np.nan})

Then you can aggregate through the axis=1

df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

answered Sep 16 '18 at 20:40

rafaelc

57,686
15
58
82

1

So it looks like the 'nan' were my problem all the time. I took them for real np.nan. Now my np.where line works. – J_Scholz Sep 17 '18 at 21:37

score 0 · Accepted Answer · answered Sep 19 '18 at 20:13

As some bad formatted data was the problem in my code and not the question asked, I wanted to add some information about the performance of all the different options. I used np.where because it uses vectorization whereas np.apply iterates through rows and therefore there is a huge performance difference.

Setting up test-df with 4*2000=8000 rows:

import pandas as pd
import numpy as np

l = [[['1','2'],np.NaN,np.NaN,np.NaN],
     [np.NaN,['2','3'],np.NaN,np.NaN],
    [np.NaN,np.NaN,['3','4'],np.NaN],
    [np.NaN,np.NaN,np.NaN,['4','5']]]

l=l*2000

df = pd.DataFrame(l)

df.columns = ['players','players1','players2','players3']

Best option:

%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

100 loops, best of 3: 2.18 ms per loop

Very good option:

%timeit df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)

100 loops, best of 3: 413 ms per loop

Other option 1:

%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

1 loop, best of 3: 4.71 s per loop

And 2:

%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)

1 loop, best of 3: 4.86 s per loop

Combine Pandas columns containing list objects

4 Answers4

Sample Data

Output: