pandas merge dataframe with NaN (or "unknown") for missing values

Question

I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.

names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
            'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
           'classification':['thief','thief','good','thief']})

I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.

Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.

EDIT: One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:

names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
            'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
           'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")

The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.

score 29 · Answer 1 · answered Jan 26 '16 at 01:34

In case you are still looking for an answer for this:

The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:

import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'], 
                      'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})

info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
                     'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)

which will result in:

      names position classification
0       bob      dev        unknown
1       bob      dev        unknown
2       bob      dev        unknown
3       bob      dev        unknown
4     frank      dev          thief
5     james      dev        unknown
6       tim      sys           good
7   ricardo      sys        unknown
8      mike      sys        unknown
9      mark      sup          thief
10     joan      sup        unknown
11      joe      sup          thief
12      joe      sup           good
13     bill  unknown          thief

Hey, I have a slightly more complicated case. What if `info` already contains NAN values, and you want to leave them as NAN, but only change the missing values — Aleksejs Fomins, Oct 04 '22 at 14:17
@AleksejsFomins I don't think I understood what you mean. Can you give an example? — oxtay, Oct 04 '22 at 20:48
I have 2 data series, each of them may have some NAN values that I wish to preserve. I want to combine the two series into a data frame by index. The indices generally match, but there could be a few mismatches. When combining, I want all indices to be present. If only one of the series had that index, I want the value for the other series be 0 for that index. However, I want the NANs that were originally in the series to stay as NANs. Now I realise I could just do some if statements and artificially ensure both series have exactly the same indices before merging. — Aleksejs Fomins, Oct 05 '22 at 07:21
I was naively expecting that there would be a parameter like "fill_missing" in pd.merge, but I can work around it, no problem. — Aleksejs Fomins, Oct 05 '22 at 07:22

score 20 · Answer 2 · answered Jan 27 '15 at 16:05

20

I think you want to perform an outer merge:

In [60]:

pd.merge(names, info, how='outer')
Out[60]:
     names position classification
0      bob      dev            NaN
1    frank      dev          thief
2    james      dev            NaN
3      tim      sys           good
4  ricardo      sys            NaN
5     mike      sys            NaN
6     mark      sup          thief
7     joan      sup            NaN
8      joe      sup          thief

There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

answered Jan 27 '15 at 16:05

EdChum

376,765
198
813
562

Thanks for that, but I'm still having some trouble getting what I want. I have updated my question with more context. – Kevin Thompson Jan 27 '15 at 18:14
Why do we get NaN here (is the float type Not a *Number*) instead of None? It seem quite odd that a float type comes into a string-only operation. – Martin Thøgersen Apr 27 '20 at 10:12
1

@MartinThøgersen because the type gets promoted due to the missing values, there is some experimental support for `NA` for scalar values: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na but not for string, once you have mixed dtypes then operations become problematic as it becomes ambiguous – EdChum Apr 27 '20 at 10:20
1

How would you convert the NaN's into None in the `classification` column? – Martin Thøgersen Apr 27 '20 at 10:32
@MartinThøgersen I would expect `df.fillna(None, inplace=True)` to work – EdChum Apr 27 '20 at 10:42
@EdChum unfortunately fillna(value=None) is treated as a missing input parameter :-( Hence the method fails. – Martin Thøgersen Apr 27 '20 at 10:45
2

@MartinThøgersen Sorry I suggested something that I thought would work without trying, this is a pandas quirk, I looked at a git issue: https://github.com/pandas-dev/pandas/issues/17494 and this works: `what['classification'].replace({np.NaN:None})` it seems that `None` just gets interpreted to `NaN`, if you pass a dict then it correctly achieves the desired result, additionally using `where` also works: what['classification'].where(pd.notnull(what['classification']), None), related https://stackoverflow.com/questions/14162723/replacing-pandas-or-numpy-nan-with-a-none-to-use-with-mysqldb – EdChum Apr 27 '20 at 11:00
Fantastic! The `what['classification'] = what['classification'].replace({np.NaN:None})` works. You saved my day. – Martin Thøgersen Apr 27 '20 at 11:08

Lucas Aimaretto · Answer 3 · 2017-10-15T21:25:08.860

Think of it as an SQL join operation. You need a left-outer join[1].

names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})

info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})

Since there are names for which there is no classification, a left-outer join will do the job.

a = pd.merge(names, info, how='left', on='names')

The result is ...

>>> a
     names position classification
0      bob      dev            NaN
1    frank      dev          thief
2    james      dev            NaN
3      tim      sys           good
4  ricardo      sys            NaN
5     mike      sys            NaN
6     mark      sup          thief
7     joan      sup            NaN
8      joe      sup          thief

... which is fine. All the NaN results are ok if you look at both the tables.

Cheers!

[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

Lukas · Answer 4 · 2021-01-22T09:07:37.960

For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:

what = names.set_index('names').join(info.set_index('names'), how='left')

resp.

what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")

set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:

# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')

what = names.join(info, how='left')

To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here

pandas merge dataframe with NaN (or "unknown") for missing values

4 Answers4

Linked