108

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:

  1. If the value in column A is not null, use that value for the new column C
  2. If the value in column A is null, use the value in column B for the new column C

Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Sevyns
  • 2,992
  • 5
  • 19
  • 23
  • is there a non-pandas answer to the same question but for numpy arrays or dict where if A is None, then take B from same array index. equivalent to SQL `coalesce(A, B)` for NULLs or `dplyr::coalesce(A, B)` in R. – Brian D Aug 03 '22 at 16:46

9 Answers9

187

use combine_first():

In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))

In [17]: df.loc[::2, 'a'] = np.nan

In [18]: df
Out[18]:
     a  b
0  NaN  0
1  5.0  5
2  NaN  8
3  2.0  8
4  NaN  3
5  9.0  4
6  NaN  7
7  2.0  0
8  NaN  6
9  2.0  5

In [19]: df['c'] = df.a.combine_first(df.b)

In [20]: df
Out[20]:
     a  b    c
0  NaN  0  0.0
1  5.0  5  5.0
2  NaN  8  8.0
3  2.0  8  2.0
4  NaN  3  3.0
5  9.0  4  9.0
6  NaN  7  7.0
7  2.0  0  2.0
8  NaN  6  6.0
9  2.0  5  2.0
Mike T
  • 41,085
  • 18
  • 152
  • 203
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
63

Coalesce for multiple columns with DataFrame.bfill

All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:

example dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
                   'col2':[np.NaN, 5, 1, 0, np.NaN],
                   'col3':[2, np.NaN, 9, 1, np.NaN],
                   'col4':[np.NaN, 10, 11, 4, 8]})

print(df)

   col1  col2  col3  col4
0   NaN   NaN   2.0   NaN
1   2.0   5.0   NaN  10.0
2   4.0   1.0   9.0  11.0
3   5.0   0.0   1.0   4.0
4   NaN   NaN   NaN   8.0

Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns

Plus, this would also work for string type columns !!

df['coalesce'] = df.bfill(axis=1).iloc[:, 0]

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0

Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow

df['coalesce'] = (
    df['col1'].combine_first(df['col2'])
        .combine_first(df['col3'])
        .combine_first(df['col4'])
)

   col1  col2  col3  col4  coalesce
0   NaN   NaN   2.0   NaN       2.0
1   2.0   5.0   NaN  10.0       2.0
2   4.0   1.0   9.0  11.0       4.0
3   5.0   0.0   1.0   4.0       5.0
4   NaN   NaN   NaN   8.0       8.0
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 16
    If you need the coalesce in a specif column order simply select the columns you need as in `df['coalesce'] = df[['col4', 'col1', 'col2', 'col3']].bfill(axis=1).iloc[:, 0]` – natbusa Aug 27 '20 at 17:49
  • 1
    In my initial testing with `timeit` on a dataset with 250k rows, just coalescing two columns: the `bfill()` method takes 1000x longer than the `combine_first()` method (20sec vs 20ms). – johnnybarrels Jun 20 '22 at 06:52
  • @johnnybarrels Well, that's disappointing. Pandas should internally do the same thing using `bfill` as `combine_first`. – wisbucky Aug 03 '22 at 16:25
  • There is a typo in this answer. axis=1 is the columns axis whereas axis=0 is the index axis. – michen00 Aug 22 '22 at 17:47
  • You're right, thanks for spotting @michen00, editted accordingly – Erfan Aug 22 '22 at 19:21
23

Try this also.. easier to remember:

df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )

This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )

%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop


%timeit  df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
Merlin
  • 24,552
  • 41
  • 131
  • 206
  • 4
    I wonder why adding the explicit `== True` makes this faster?!? Feel counter-intuitive. (But I'm learning not to try and intuit _anything_ where Pandas in concerned...) – dsz Aug 22 '22 at 22:19
  • It might not be true in newer versions, it's possible that older versions did some kind of unnecessary extra conversion. Note that this answer is from 2016! – shadowtalker Aug 19 '23 at 15:51
20

combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.

Case #1: Non-mutually Exclusive NaNs

Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
    'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})      
df

     a    b
0  1.0  5.0
1  2.0  3.0
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  7.0  6.0
6  NaN  7.0

Let's combine first on a.

Series.mask

df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

Series.where

df['a'].where(pd.notnull, df['b'])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    7.0
6    7.0
Name: a, dtype: float64

You can use similar syntax using np.where.

Alternatively, to combine first on b, switch the conditions around.


Case #2: Mutually Exclusive Positioned NaNs

All rows have NaNs which are mutually exclusive between columns.

df = pd.DataFrame({
    'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
    'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df

     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
3  NaN  4.0
4  5.0  NaN
5  NaN  6.0
6  NaN  7.0

Series.update

This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.

df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  NaN  4.0
4  5.0  5.0
5  NaN  6.0
6  NaN  7.0

Series.add

df['a'].add(df['b'], fill_value=0)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64

DataFrame.fillna + DataFrame.sum

df.fillna(0).sum(1)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
dtype: float64
Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
3

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:

Build dummy data

import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
                   'a2': [2, None, 4, None],
                   'a3': [4, 5, None, None],
                   'a4': [None, None, None, None],
                   'b1': [9, 9, 9, 999]})

df
    a1   a2   a3    a4   b1
0  NaN  2.0  4.0  None    9
1  2.0  NaN  5.0  None    9
2  3.0  4.0  NaN  None    9
3  NaN  NaN  NaN  None  999

coalesce a1 a2, a3 into a new column A

def get_first_non_null(dfrow, columns_to_search):
    for c in columns_to_search:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None

# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)

print(df)
    a1   a2   a3    a4   b1    A
0  NaN  2.0  4.0  None    9  2.0
1  2.0  NaN  5.0  None    9  2.0
2  3.0  4.0  NaN  None    9  3.0
3  NaN  NaN  NaN  None  999  NaN
David Smith
  • 161
  • 1
  • 9
1

I'm thinking a solution like this,

def coalesce(s: pd.Series, *series: List[pd.Series]):
    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s

because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,

df['d'] = coalesce(df.a, df.b, df.c)
0

For a more general case, where there are no NaNs but you want the same behavior:

Merge 'left', but override 'right' values where possible

Cilantro Ditrek
  • 1,047
  • 1
  • 14
  • 26
0

Good code, put you have a typo for python 3, correct one looks like this

    """coalesce the column information like a SQL coalesce."""
    for other in series:
        s = s.mask(pd.isnull, other)        
    return s
0

Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html

Sample Dataframe:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
                   'B':[np.NaN, 2, 3, 4, np.NaN]})

Coalesce using DuckDB:

import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)

Output:

     A    B    c
0  1.0  NaN  1.0
1  NaN  2.0  2.0
2  3.0  3.0  3.0
3  4.0  4.0  4.0
4  5.0  NaN  5.0