0

I wanted to fill in the na values in 'Col' by the ID number. I have tried groupby

df=pd.DataFrame({
    'ID':[1,2,1,2,1,2],
    'Col':['One','NaN','NaN','Two','NaN','NaN']
})

This is the expected output:

df=pd.DataFrame({
    'ID':[1,2,1,2,1,2],
    'Col':['One','Two','One','Two','One','Two']
})

I know this is an easy example but I would appreciate any help you could give me. Also I have a dataframe with 1 million rows so anything that would be time efficient would be appreciated

What I have tried:

x=df_total[df_total['id'].astype(str)=='2']
buck_map = dict(x[~x['buckets'].isnull()][['id','buckets']].values)
x['buckets']=x['id'].map(buck_map)
Shawn Jamal
  • 170
  • 8
  • 1
    Does this answer your question? [Fill Na in multiple columns with values from another column within the pandas data frame](https://stackoverflow.com/questions/57303445/fill-na-in-multiple-columns-with-values-from-another-column-within-the-pandas-da) – Franciska Mar 03 '23 at 17:57
  • 1
    Note ; you do not have NaN values in your sample; you have strings of `'NaN'` – user19077881 Mar 03 '23 at 18:05
  • Unfortunately it does not help – Shawn Jamal Mar 03 '23 at 18:06
  • Please clean up your example (no strings `'NaN'`) and also describe what you want to happen in the case there are conflicting values of `Col` for a given `ID`. If you believe there is never, ever such conflicting values, I'd advise you to be defensive and `assert` so. Data is _most of the time_ messy. – Pierre D Mar 04 '23 at 18:59

3 Answers3

1

It is not clear what you really want and if it is just a translation and substitution or if groupby is needed. Assumining you mean strings for the column and that you want just a substitution then you need a way of translating such as 1 to 'One' (a Dictionary is ideal) and then applying this to each row. You can use:

import pandas as pd

df=pd.DataFrame({
    'ID':[1,2,1,2,1,2],
    'Col':['One','NaN','NaN','Two','NaN','NaN']
})

def func(row):
    d= {0: 'zero', 1:'One', 2:'Two'}
    if row['Col'] == 'NaN':
        val = d[row['ID']]
    else:
        val = row['Col']
    return val
 
df['Col'] = df.apply(func, axis = 1)

print(df)

which gives:

   ID  Col
0   1  One
1   2  Two
2   1  One
3   2  Two
4   1  One
5   2  Two
user19077881
  • 3,643
  • 2
  • 3
  • 14
1

Your question is ambiguous, as there are several ways to produce the desired output based on your example.

Assuming that you are looking for the "majority value" per ID, and also that the NaNs are actual float('NaN') and to be dropped, and not just the string 'NaN', then the following would be quite efficient:

def majority(s):
    return s.mode()[0]

newdf = df.assign(Col=df.groupby('ID')['Col'].transform(majority))

>>> newdf
   ID  Col
0   1  One
1   2  Two
2   1  One
3   2  Two
4   1  One
5   2  Two

Note: to make sure the 'NaN' are nan and not strings, do this first:

df = df.assign(Col=df['Col'].replace({'NaN': float('Nan')}))
Pierre D
  • 24,012
  • 7
  • 60
  • 96
1

You can create a dictionary mapping ID values to fill values:

fill_dict = df.groupby('ID')['Col'].last().to_dict()

then replace NaN values with fill values using the dictionary:

df['Col'] = df['Col'].fillna(df['ID'].map(fill_dict))
godot
  • 3,422
  • 6
  • 25
  • 42