3

I have a wide data frame with several years:

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1991, 1992, 1993, 1994, 1995, 1996, '2010-2012'], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19, np.NaN],
                                   [16, 17, 18, 19, 20, 21, np.NaN],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17, 31],
                                   [np.NaN, 22, 23, 24, np.NaN, 26, np.NaN],
                                   [36, 36, 37, 38, 39, 40, 55]]))

Year     1991  1992  1993  1994  1995  1996  2010-2012
ID                                                    
29925     NaN   NaN  16.0  17.0  18.0  19.0        NaN
223725   16.0  17.0  18.0  19.0  20.0  21.0        NaN
280165    NaN   NaN   NaN   NaN  16.0  17.0       31.0
813285    NaN  22.0  23.0  24.0   NaN  26.0        NaN
956765   36.0  36.0  37.0  38.0  39.0  40.0       55.0

The values in each row are the age of each person, with each holding a unique ID. I want to fill the NaN of this data frame in each year of every row, based on the existing age values in each row.

For example, ID 29925 is 16 in 1993, we know they are 15 in 1992 and 14 in 1991, therefore we want to replace the NaN for 29925 in the columns 1992 and 1991. Similarly, I want to replace the NaN in the column2010-2012 based on the existing age values for 29925. Let's assume that 29925 is 15 years older from 1996 in the 2010-2012 column. What is the fastest way to do this for the whole data frame - i.e for all IDs?

ALollz
  • 57,915
  • 7
  • 66
  • 89
MI MA
  • 171
  • 5
  • last line first 36 is 35, yepp? There shouldn't be two 36s... – zabop Jul 24 '20 at 19:00
  • Yeah, can you definitely explain that last row? If there's a possibility that your data are messy such that age doesn't always increase by 1 year to year (or only by 14 from 1996- 2010-2012 instead of 15) how do you want to resolve such cases? In that case I could imagine you might have gaps like `35 NaN 36` and it becomes impossible to know whether that `NaN` should become 35 or 36... – ALollz Jul 24 '20 at 19:19
  • Hi thank you both for the replies, we can argue it is unknown whether the last line should be 35 or 36. The data is from a hypothetical survey and that person could have been 36 time they responded to the survey in both 1991 and 1992, for example they responded in Dec-1991 and again in Jan-1992, their birthday is in June. As for the cases with the other ```NaNs```, it should be filled according to a general rule of +1 or -1 according to the year. – MI MA Jul 25 '20 at 11:42

1 Answers1

2

# imports we need later
import numpy as np
import pandas as pd

This is a not a particularly efficient method but it works. I'll leave out your last column, to make things more systematic.

The df:

df = pd.DataFrame(index=pd.Index([29925, 223725, 280165, 813285, 956765], name='ID'),
                  columns=pd.Index([1992, 1992, 1993, 1994, 1995, 1996], name='Year'),
                  data = np.array([[np.NaN, np.NaN, 16, 17, 18, 19],
                                   [16, 17, 18, 19, 20, 21],
                                   [np.NaN, np.NaN, np.NaN, np.NaN, 16, 17],
                                   [np.NaN, 22, 23, 24, np.NaN, 26],
                                   [35, 36, 37, 38, 39, 40]]))

enter image description here

Calculate date of birth for everyone:

dob=[]
for irow, row in enumerate(df.iterrows()):
    dob.append(np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]))

or, if you are into list comprehensions:

dob = [np.asarray([int(each) for each in df.columns]) - np.asarray(df.iloc[irow,:]) for irow, row in enumerate(df.iterrows())]

Now dob is like this:

[array([  nan,   nan, 1977., 1977., 1977., 1977.]),
 array([1976., 1975., 1975., 1975., 1975., 1975.]),
 array([  nan,   nan,   nan,   nan, 1979., 1979.]),
 array([  nan, 1970., 1970., 1970.,   nan, 1970.]),
 array([1956., 1956., 1956., 1956., 1956., 1956.])]

Make a simpler dob list using np.unique, remove nans:

dob_filtered=[np.unique(each[~np.isnan(each)])[0] for each in dob]

dob_filtered now looks like this:

[1977.0, 1975.0, 1979.0, 1970.0, 1956.0]

Attach this list to dataframe:

df['dob']=dob_filtered

Fill in the NaNs of the df using the dob column:

for irow, row in enumerate(df.index):
    for icol, col in enumerate(df.columns[:-2]):
        df.loc[row,col] = col - df['dob'][row]

Delete the dob column (just to obtain the original columns only, otherwise not important):

df.drop(['dob'],axis=1)

Obtaining:

Year    1992    1992    1993    1994    1995    1996
ID                      
29925   15.0    15.0    16.0    17.0    18.0    19.0
223725  17.0    17.0    18.0    19.0    20.0    21.0
280165  13.0    13.0    14.0    15.0    16.0    17.0
813285  22.0    22.0    23.0    24.0    25.0    26.0
956765  36.0    36.0    37.0    38.0    39.0    40.0

ie

enter image description here

zabop
  • 6,750
  • 3
  • 39
  • 84
  • 1
    I had been thinking about this for so long and honestly creating a DataFrame of the birth year really is the key to doing this efficiently. You can actually get around a lot of the iterations once you have `dob` and fffil and bffill it. – ALollz Jul 24 '20 at 19:06
  • I found the problem intriguing as well; if you can make it more efficient by getting around the iterations, I am interested! – zabop Jul 24 '20 at 19:12
  • (Feel free to edit my answer or add your own using parts of mine) – zabop Jul 24 '20 at 19:12
  • dob is the key... if df has many rows it might be faster to build the whole df from scratch: `df_new = pd.DataFrame([[col - dob for col in df.columns[:-2]] for dob in df['dob'], index=df.index, columns=df.columns[:-2])` ...then add 2010-2012 unchanged – RichieV Jul 24 '20 at 19:14