Coalesce values from 2 columns into a single column in a pandas dataframe

Question

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:

If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C

Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?

is there a non-pandas answer to the same question but for numpy arrays or dict where if A is None, then take B from same array index. equivalent to SQL `coalesce(A, B)` for NULLs or `dplyr::coalesce(A, B)` in R. — Brian D, Aug 03 '22 at 16:46

score 187 · Accepted Answer · edited Apr 07 '22 at 03:15

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0

Erfan · Answer 2 · 2022-08-22T19:20:24.517

Coalesce for multiple columns with `DataFrame.bfill`

All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:

example dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
                   'col2':[np.NaN, 5, 1, 0, np.NaN],
                   'col3':[2, np.NaN, 9, 1, np.NaN],
                   'col4':[np.NaN, 10, 11, 4, 8]})

print(df)

   col1  col2  col3  col4
0   NaN   NaN   2.0   NaN
1   2.0   5.0   NaN  10.0
2   4.0   1.0   9.0  11.0
3   5.0   0.0   1.0   4.0
4   NaN   NaN   NaN   8.0

Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns

Plus, this would also work for string type columns !!

df['coalesce'] = df.bfill(axis=1).iloc[:, 0]

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow

df['coalesce'] = (
    df['col1'].combine_first(df['col2'])
        .combine_first(df['col3'])
        .combine_first(df['col4'])
)

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

If you need the coalesce in a specif column order simply select the columns you need as in `df['coalesce'] = df[['col4', 'col1', 'col2', 'col3']].bfill(axis=1).iloc[:, 0]` — natbusa, Aug 27 '20 at 17:49
In my initial testing with `timeit` on a dataset with 250k rows, just coalescing two columns: the `bfill()` method takes 1000x longer than the `combine_first()` method (20sec vs 20ms). — johnnybarrels, Jun 20 '22 at 06:52
@johnnybarrels Well, that's disappointing. Pandas should internally do the same thing using `bfill` as `combine_first`. — wisbucky, Aug 03 '22 at 16:25
There is a typo in this answer. axis=1 is the columns axis whereas axis=0 is the index axis. — michen00, Aug 22 '22 at 17:47
You're right, thanks for spotting @michen00, editted accordingly — Erfan, Aug 22 '22 at 19:21

Merlin · Answer 3 · 2017-12-29T17:22:25.263

23

Try this also.. easier to remember:

df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )

This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop


%timeit  df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop

edited Dec 29 '17 at 17:22

answered Jul 01 '16 at 20:48

Merlin

24,552
41
131
206

4

I wonder why adding the explicit `== True` makes this faster?!? Feel counter-intuitive. (But I'm learning not to try and intuit _anything_ where Pandas in concerned...) – dsz Aug 22 '22 at 22:19
It might not be true in newer versions, it's possible that older versions did some kind of unnecessary extra conversion. Note that this answer is from 2016! – shadowtalker Aug 19 '23 at 15:51

score 20 · Answer 4 · edited Jun 20 '20 at 09:12

combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.

Case #1: Non-mutually Exclusive NaNs

Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
    'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})      
df

     a    b
0  1.0  5.0
1  2.0  3.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  7.0  6.0
6  NaN  7.0

Let's combine first on a.

Series.mask

df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

Series.where

df['a'].where(pd.notnull, df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

You can use similar syntax using np.where.

Alternatively, to combine first on b, switch the conditions around.

Case #2: Mutually Exclusive Positioned NaNs

All rows have NaNs which are mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
    'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df

     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  NaN  6.0
6  NaN  7.0

Series.update

This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.

df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  NaN  4.0
4  5.0  5.0
5  NaN  6.0
6  NaN  7.0

Series.add

df['a'].add(df['b'], fill_value=0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

DataFrame.fillna + DataFrame.sum

df.fillna(0).sum(1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

score 3 · Answer 5 · answered Jun 03 '19 at 01:04

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:

Build dummy data

import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
                   'a2': [2, None, 4, None],
                   'a3': [4, 5, None, None],
                   'a4': [None, None, None, None],
                   'b1': [9, 9, 9, 999]})

df

    a1   a2   a3    a4   b1
0  NaN  2.0  4.0  None    9
1  2.0  NaN  5.0  None    9
2  3.0  4.0  NaN  None    9
3  NaN  NaN  NaN  None  999

coalesce a1 a2, a3 into a new column A

def get_first_non_null(dfrow, columns_to_search):
    for c in columns_to_search:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None

# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)

print(df)

    a1   a2   a3    a4   b1    A
0  NaN  2.0  4.0  None    9  2.0
1  2.0  NaN  5.0  None    9  2.0
2  3.0  4.0  NaN  None    9  3.0
3  NaN  NaN  NaN  None  999  NaN

Christian DiMare · Answer 6 · 2019-09-04T22:28:21.047

1

I'm thinking a solution like this,

def coalesce(s: pd.Series, *series: List[pd.Series]):
    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,

df['d'] = coalesce(df.a, df.b, df.c)

edited Sep 04 '19 at 22:28

answered Sep 04 '19 at 21:53

Christian DiMare

11
1
5

This is the answer, the others are super slow – SLuck Jul 24 '23 at 15:13

score 0 · Answer 7 · answered Sep 16 '19 at 17:18

0

For a more general case, where there are no NaNs but you want the same behavior:

Merge 'left', but override 'right' values where possible

answered Sep 16 '19 at 17:18

Cilantro Ditrek

1,047
1
14
26

score 0 · Answer 8 · answered Oct 30 '19 at 13:19

0

Good code, put you have a typo for python 3, correct one looks like this

    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

answered Oct 30 '19 at 13:19

Stefan Voshage

1

Ravishankar Sivasubramaniam · Answer 9 · 2021-08-27T18:20:39.037

Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html

Sample Dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
                   'B':[np.NaN, 2, 3, 4, np.NaN]})

Coalesce using DuckDB:

import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)

Output:

     A    B    c
0  1.0  NaN  1.0
1  NaN  2.0  2.0
2  3.0  3.0  3.0
3  4.0  4.0  4.0
4  5.0  NaN  5.0

Coalesce values from 2 columns into a single column in a pandas dataframe

9 Answers9

Coalesce for multiple columns with `DataFrame.bfill`

Case #1: Non-mutually Exclusive NaNs

Case #2: Mutually Exclusive Positioned NaNs

Build dummy data

coalesce a1 a2, a3 into a new column A

Linked

Related

Coalesce values from 2 columns into a single column in a pandas dataframe

9 Answers9

Coalesce for multiple columns with DataFrame.bfill

Case #1: Non-mutually Exclusive NaNs

Case #2: Mutually Exclusive Positioned NaNs

Build dummy data

coalesce a1 a2, a3 into a new column A

Linked

Related

Coalesce for multiple columns with `DataFrame.bfill`