2

I have a pandas dataframe where half of the rows (where buffer_id = 'off') have missing values (NaNs and -9999) in three of the columns. I'm trying to fill those missing values using the corresponding values from the other half (where buffer_id = 'on') based on a matching site ID.

Here is a reproducible example:

df = pd.DataFrame({'site_id': ['A', 'B', 'C', 'A', 'B', 'C'],
               'buffer_id': ['on', 'on', 'on', 'off', 'off', 'off'],
               'easting': [111, 222, 333, 'NaN', 'NaN', 'NaN'],
               'northing': [444, 555, 666, 'NaN', 'NaN', 'NaN'],
               'year': [1990, 1995, 2000, -9999, -9999, -9999],
               'ndvi': [12, 22, 32, 42, 52, 62]})

So the missing values in 'easting', 'northing', and 'year' should be filled with the values from the rows with corresponding site ID.

How would you go about doing that?

Thank you.

vancaron
  • 21
  • 3
  • Does this answer your question? [Python Pandas fill dataframe with missing values](https://stackoverflow.com/questions/44711471/python-pandas-fill-dataframe-with-missing-values) – Dhana D. Aug 31 '21 at 19:08
  • Does this answer your question? https://stackoverflow.com/questions/30357276/how-to-pass-another-entire-column-as-argument-to-pandas-fillna – ccommjin Aug 31 '21 at 19:52

2 Answers2

1

Using pandas, I would suggest building a table of unique attributes for each site, and left joining it back to your orginal dataframe, as follows:

# Import packages
import pandas as pd
import numpy as np

# Your data frame
df = pd.DataFrame({'site_id': ['A', 'B', 'C', 'A', 'B', 'C'],
               'buffer_id': ['on', 'on', 'on', 'off', 'off', 'off'],
               'easting': [111, 222, 333, 'NaN', 'NaN', 'NaN'],
               'northing': [444, 555, 666, 'NaN', 'NaN', 'NaN'],
               'year': [1990, 1995, 2000, -9999, -9999, -9999],
               'ndvi': [12, 22, 32, 42, 52, 62]})

# Clean NAs in year column
df['year'] = df['year'].replace(-9999, np.nan)

# Make table of unique attributes for each site (easting, northing, and year)
site_attributes = df[['site_id', 'easting', 'northing', 'year']].dropna().drop_duplicates()

# Join unique attributes to your original table
new_df = df[['site_id', 'buffer_id', 'ndvi']].merge(site_attributes, 'left')
Xavier GB
  • 97
  • 1
  • 7
0

This should do the trick, vantastic!

    # create Python dictionary holding key value pairs
    # key is the well_id, value is a List holding the dataframe index of the identical well IDs
    g = df.groupby('well_id').groups
        
    #iterate through the lists in the dictionary
    for val in g.values():
        # assign the value of the easting, northing, and year columns from the first index of the pair to the second index
        df.loc[val[1], 'easting'] = df.loc[val[0], 'easting']
        df.loc[val[1], 'northing'] = df.loc[val[0], 'northing']
        df.loc[val[1], 'year'] = df.loc[val[0], 'year']
conormacn
  • 1
  • 1
  • If `NaN`s are the first value in the dictionary, then this could potentially overwrite any subsequent values with `NaN`. To use this approach safely, the dataframe would have to be sorted. – Xavier GB Aug 31 '21 at 22:08
  • Please provide additional details in your answer. As it's currently written, it's hard to understand your solution. – Community Sep 01 '21 at 00:05