Combining columns of dataframe

Question

I have dataframe like this:

   c1   c2   c3
0   a   NaN  NaN
1  NaN   b   NaN
2  NaN  NaN   c
3  NaN   b   NaN
4   a   NaN  NaN

I want to combine these three columns like this :

Here is the code to make the above data frame:

a = pd.DataFrame({
    'c1': ['a',np.NaN,np.NaN,np.NaN,'a'],
    'c2': [np.NaN,'b',np.NaN,'b',np.NaN],
    'c3': [np.NaN,np.NaN,'c',np.NaN,np.NaN]
})

@cs95 I saw that ans too man and this is not exactly same !!! (∩︵∩) !!! — luckyCasualGuy, Jul 06 '20 at 07:54
Unfortunately it sort of is. Both questions call for collapsing non-null values into a single column. I didn't think to look for a duplicate until I started looking for the source to [Divakar's justify code](https://stackoverflow.com/questions/44558215/python-justifying-numpy-array/44559180#44559180). — cs95, Jul 06 '20 at 07:56
Nothing wrong with marking this question as duplicate - it is a _good_ thing, you are acting as a guidepost to other, more standard resources on the site. I am not "flagging" you, this is a privilege I am using as a member with more experience on the site :-) — cs95, Jul 06 '20 at 08:03
Additionally, you've already received answers to your _own_ question, which doesn't always happen for a question marked duplicate, so that's a good thing! Did either of those answers work for you? — cs95, Jul 06 '20 at 08:04
Others cannot provide me with more options if its marked duplicate — luckyCasualGuy, Jul 06 '20 at 08:05
If you have any issue with the existing answers, please leave a comment and we'd get back ASAP. As for _new_ options, please take it from me they'll rehash the answers in [this link](https://stackoverflow.com/questions/56583174/how-to-collapse-columns-in-pandas-on-null-values), so I see no point in reopening. Last word on this. — cs95, Jul 06 '20 at 08:06

MrNobody33 · Answer 1 · 2020-07-06T07:56:38.287

4

You could try this:

import pandas as pd
import numpy as np
a = pd.DataFrame({
    'c1': ['a',np.NaN,np.NaN,np.NaN,'a'],
    'c2': [np.NaN,'b',np.NaN,'b',np.NaN],
    'c3': [np.NaN,np.NaN,'c',np.NaN,np.NaN]
})

newdf=pd.DataFrame({'c4':a.fillna('').values.sum(axis=1)})

Output:

newdf

  c4
0  a
1  b
2  c
3  b
4  a

I just see this option retrieved from jpp's answer, where jpp take advantage of the fact that np.nan != np.nan and uses a list comprehension, maybe it could be the fastest way:

newdf=pd.DataFrame({'c4':[i  for row in a.values for i in row if i == i]})
print(newdf)

edited Jul 06 '20 at 07:56

answered Jul 06 '20 at 07:35

MrNobody33

6,413
7
19

This is a good option if it is guaranteed for rows to have at most one non-null column. +1 – cs95 Jul 06 '20 at 07:37
Yeah, that's true, thanks for noticing @cs95 and for the upvote too! :) – MrNobody33 Jul 06 '20 at 07:59

cs95 · Accepted Answer · 2020-07-06T07:43:58.513

4

bfilling is one option:

a.bfill(axis=1).iloc[:,0]

0    a
1    b
2    c
3    b
4    a
Name: c1, dtype: object

Another one is a simple stack, gets rid of NaNs.

a.stack().reset_index(level=1, drop=True) 


0    a
1    b
2    c
3    b
4    a
dtype: object

Another interesting option you don't see everyday is using the power of NumPy. Here's a modified version of Divakar's justify utility that works with object DataFrames.

justify(a.to_numpy(), invalid_val=np.nan)[:,0]
# array(['a', 'b', 'c', 'b', 'a'], dtype=object)

# as a Series
pd.Series(justify(a.to_numpy(), invalid_val=np.nan)[:,0], index=a.index)

0    a
1    b
2    c
3    b
4    a
dtype: object

edited Jul 06 '20 at 07:43

answered Jul 06 '20 at 07:37

cs95

379,657
97
704
746

stack is the better option IMO – sammywemmy Jul 06 '20 at 07:38
2

@sammywemmy you know, `stack` is notorious for not performing great but when up against an even more inefficient row wise backfill... perhaps the winner. – cs95 Jul 06 '20 at 07:39
1

I Agree ★⌒(●ゝω・)ｂTHX !!!! – luckyCasualGuy Jul 06 '20 at 07:41

Combining columns of dataframe

2 Answers2

Linked