Backfill values in Pandas series when value matches another column

Question

I have a DataFrame like this:

import numpy as np

raw_data = {'surface': [np.nan, np.nan, 'round', 'square'],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami']}

df = pd.DataFrame(raw_data, columns = ['surface', 'city'])

This looks like this:

        surface city
   0    NaN     San Francisco
   1    NaN     Miami
   2    round   San Francisco
   3    square  Miami

I need earliest instance of the San Francisco row to be filled with 'round', and the earlier Miami row to be filled with 'square'. Using .fillna(method='bfill') won't take into account other column values, and just fills all earlier rows with round.

The result would be:

        surface city
   0    round   San Francisco
   1    square  Miami
   2    round   San Francisco
   3    square  Miami

Do you want to fill NaN with the first real value from the same city? — Allen Qin, May 08 '17 at 03:11

score 1 · Accepted Answer · answered May 08 '17 at 03:07

1

You can use groupby.bfill; group data frame by city column and then use bfill:

df.groupby('city').bfill()

#  surface           city
#0  round   San Francisco
#1  square          Miami
#2  round   San Francisco
#3  square          Miami

answered May 08 '17 at 03:07

Psidom

209,562
33
339
356

This solution will do a *back fill* on the data frame for each city, if we assume pandas groupby is [stable](http://stackoverflow.com/questions/39373820/is-pandas-dataframe-groupby-guaranteed-to-be-stable), which probably is. – Psidom May 08 '17 at 03:15

score 1 · Answer 2 · edited May 23 '17 at 12:10

[Modified based on the admirable answer from PSidom]

Using groupby() is the key point indeed, but it might be confusing not to mention what bfill() does as it's not doing what you actually think it does.

Let's take a quick glance at the doc here. Instead of back filling the data like what the OP wants, it actually just fill in the missing data with non-missing data in the next column. It works great with groupby() in this case, while you also need to do groupby('*your group*').ffill() for forward filling in case that the data you have are more complicated.

For further illustration, let's modify your data like this:

import numpy as np
import pandas as pd

raw_data = {'surface': [np.nan, np.nan, 'round', 'square', np.nan, np.nan, np.nan, np.nan],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami', 'Miami', 'Miami', 'San Francisco', 'Miami']}
df = pd.DataFrame(raw_data, columns = ['surface', 'city'])
df

#   surface city
#0  NaN     San Francisco
#1  NaN     Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

With only df.groupby('city').bfill(), you'll got:

df2 = df.groupby('city').bfill()
df2

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

See what is going on there? bfill() did the job in row 0 and 1, but remain row 4 ~ 7 unchanged. You should use both bfill() and ffill() instead. Maybe something like this:

df3 =  df2.groupby('city').ffill()
df3

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  square  Miami
#5  square  Miami
#6  round   San Francisco
#7  square  Miami

To be noticed, you shouldn't use something like df.groupby('city').bfill().ffill(). It'll fill in something wrong there.

Backfill values in Pandas series when value matches another column

2 Answers2

Linked

Related