2

My question is similar to 'Combine pandas string columns with missing values ' and 'How to merge/combine columns in pandas?' but as the cells I want to combine contain lists, all the given answers don't work.

Simplified, my df looks like this:

        players    players1    players2    players3
1       ['1','2']  
2       ['2','4']
3                  ['1','4']
4                              ['1','5']
5                                          ['3','5']
6
7                  ['3','4']

(So i know there will never be values in two of the columns.)

Now I want to add a new column combining the lists:

        players    players1    players2    players3     players_combine
1       ['1','2']                                       ['1','2']
2       ['2','4']                                       ['2','4']
3                  ['1','4']                            ['1','4']
4                              ['1','5']                ['1','5']
5                                          ['3','5']    ['3','5']
6
7                  ['3','4']                            ['3','4']

I tried many things - mainly variations of the linked answers, my last idea was a cascading np.where. But it didn't work. Only the values from 'players' showed up in the combined column.

df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

EDIT: As asked in the comments: df.head(5).to_dict()

{'players': {'5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan'}, 'players1': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': ''}, 'players2': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': ''}, 'players3': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']}}
J_Scholz
  • 476
  • 3
  • 12

4 Answers4

2

If any of the empty cells is an empty string (''), first make it a NaN:

df[df==''] = np.nan

Then, select the maximum of all non-NaNs in each row:

df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#6       NaN
#7    [3, 4]

Another interesting (and faster) solution is to eliminate the all-NaN rows and then find the first valid value in each row:

df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#7    [3, 4]
DYZ
  • 55,249
  • 10
  • 64
  • 93
2

Since you know you will only a value in at most a single column for each row, you can replace the values you don't care about with NaN and then use .stack.

In this case it looks like you have both 'nan' and '' strings which should be replaced with np.NaN first.

import numpy as np

df['players_combine'] = df.replace({'': np.NaN, 'nan': np.NaN}, regex=True).stack().reset_index(level=1, drop=True)

Sample Data

import pandas as pd
df = pd.DataFrame({'players': [['1','2'], '', '', np.NaN, ''],
                   'players1': ['', ['2','4'], '', np.NaN, ''],
                   'players2': ['', '', ['1','5'], np.NaN, ''],
                   'players3': ['', '', np.NaN, ['3', '5'], '']})

Output:

  players players1 players2 players3 players_combine
0  [1, 2]                                     [1, 2]
1           [2, 4]                            [2, 4]
2                    [1, 5]      NaN          [1, 5]
3     NaN      NaN      NaN   [3, 5]          [3, 5]
4                                                NaN
ALollz
  • 57,915
  • 7
  • 66
  • 89
1

Analyzing your df.to_dict() it seems like, for some cells, you have the string nan, for others your have the actual np.nan and, for some others, you even have empty strings ''.

So first clean your data set by making null values uniform:

df = df.replace({'nan':np.nan, '':np.nan})

Then you can aggregate through the axis=1

df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • 1
    So it looks like the 'nan' were my problem all the time. I took them for real np.nan. Now my np.where line works. – J_Scholz Sep 17 '18 at 21:37
0

As some bad formatted data was the problem in my code and not the question asked, I wanted to add some information about the performance of all the different options. I used np.where because it uses vectorization whereas np.apply iterates through rows and therefore there is a huge performance difference.


Setting up test-df with 4*2000=8000 rows:

import pandas as pd
import numpy as np

l = [[['1','2'],np.NaN,np.NaN,np.NaN],
     [np.NaN,['2','3'],np.NaN,np.NaN],
    [np.NaN,np.NaN,['3','4'],np.NaN],
    [np.NaN,np.NaN,np.NaN,['4','5']]]

l=l*2000

df = pd.DataFrame(l)

df.columns = ['players','players1','players2','players3']

Best option:

%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

100 loops, best of 3: 2.18 ms per loop

Very good option:

%timeit df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)

100 loops, best of 3: 413 ms per loop

Other option 1:

%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

1 loop, best of 3: 4.71 s per loop

And 2:

%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)

1 loop, best of 3: 4.86 s per loop
J_Scholz
  • 476
  • 3
  • 12