1

I have a DataFrame like this:

import numpy as np

raw_data = {'surface': [np.nan, np.nan, 'round', 'square'],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami']}

df = pd.DataFrame(raw_data, columns = ['surface', 'city'])

This looks like this:

        surface city
   0    NaN     San Francisco
   1    NaN     Miami
   2    round   San Francisco
   3    square  Miami

I need earliest instance of the San Francisco row to be filled with 'round', and the earlier Miami row to be filled with 'square'. Using .fillna(method='bfill') won't take into account other column values, and just fills all earlier rows with round.

The result would be:

        surface city
   0    round   San Francisco
   1    square  Miami
   2    round   San Francisco
   3    square  Miami
minimumnz
  • 35
  • 1
  • 6

2 Answers2

1

You can use groupby.bfill; group data frame by city column and then use bfill:

df.groupby('city').bfill()

#  surface           city
#0  round   San Francisco
#1  square          Miami
#2  round   San Francisco
#3  square          Miami
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • This solution will do a *back fill* on the data frame for each city, if we assume pandas groupby is [stable](http://stackoverflow.com/questions/39373820/is-pandas-dataframe-groupby-guaranteed-to-be-stable), which probably is. – Psidom May 08 '17 at 03:15
1

[Modified based on the admirable answer from PSidom]

Using groupby() is the key point indeed, but it might be confusing not to mention what bfill() does as it's not doing what you actually think it does.

Let's take a quick glance at the doc here. Instead of back filling the data like what the OP wants, it actually just fill in the missing data with non-missing data in the next column. It works great with groupby() in this case, while you also need to do groupby('*your group*').ffill() for forward filling in case that the data you have are more complicated.

For further illustration, let's modify your data like this:

import numpy as np
import pandas as pd

raw_data = {'surface': [np.nan, np.nan, 'round', 'square', np.nan, np.nan, np.nan, np.nan],
            'city': ['San Francisco', 'Miami', 'San Francisco', 'Miami', 'Miami', 'Miami', 'San Francisco', 'Miami']}
df = pd.DataFrame(raw_data, columns = ['surface', 'city'])
df

#   surface city
#0  NaN     San Francisco
#1  NaN     Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

With only df.groupby('city').bfill(), you'll got:

df2 = df.groupby('city').bfill()
df2

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  NaN     Miami
#5  NaN     Miami
#6  NaN     San Francisco
#7  NaN     Miami

See what is going on there? bfill() did the job in row 0 and 1, but remain row 4 ~ 7 unchanged. You should use both bfill() and ffill() instead. Maybe something like this:

df3 =  df2.groupby('city').ffill()
df3

#   surface city
#0  round   San Francisco
#1  square  Miami
#2  round   San Francisco
#3  square  Miami
#4  square  Miami
#5  square  Miami
#6  round   San Francisco
#7  square  Miami

To be noticed, you shouldn't use something like df.groupby('city').bfill().ffill(). It'll fill in something wrong there.

Community
  • 1
  • 1
CDtoday
  • 59
  • 5