1

I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?

For example,

      1     2   3     4     5   6   7  8  9  10  11    12    13  14    15    16
83  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0 16.0  17.0 NaN  28.0  30.0 NaN  28.0  18.0

The goal is for the data to look like this:

      1     2   3     4     5   6   7  ...    10  11    12    13  14    15    16
83  NaN  NaN NaN  27.0  29.0 29.0 30.0  ...  15.0 16.0  17.0  28.0 30.0  28.0  18.0

The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
ldoddy
  • 21
  • 3
  • 3
    You have five nans in the first "row" and three in the second, which makes it a bit puzzling. – Igor Rivin May 01 '20 at 03:40
  • That's unfortunately the best way I could type it. The dataframe is quite large, so it only shows an abridged version of each row. – ldoddy May 01 '20 at 05:06
  • `If there are not >= 5 data-filled cells, then take the average of however many cells there are. ` Can you add some data to question with this? also what is `8,9` column with expected output? It seems data are gone from this columns. – jezrael May 01 '20 at 05:12

2 Answers2

1

Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:

def fun(row):
    index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
    row.iloc[:] = row[index_order].values
    return row

df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)

If you need to overwrite the results in the same dataframe then:

df.loc[:,df.columns[1:]] = df_fix.copy()
jcaliz
  • 3,891
  • 2
  • 9
  • 13
  • This is a huge step in the right direction! Thank you. There is a name column that is at column[0], so how would I work around that, as the names are integral to the relational aspect of the dataframe. – ldoddy May 01 '20 at 05:05
  • just call the `fun` of the answer on row[1:], and prepend the first element to the result. – Igor Rivin May 01 '20 at 05:10
  • Thanks @IgorRivin, you can also apply the function to all columns but the first one (column[0]), check the edit. – jcaliz May 01 '20 at 05:20
  • @jcaliz - hmmm, I think better is not using loops in pandas, because slow, check timings in my answer. – jezrael May 01 '20 at 05:31
  • 1
    Yeah @jezrael I agree that loops are not good for pandas, this was just an easy quick and simple to explain solution. – jcaliz May 01 '20 at 05:34
1

Use function justify for improve performance with filter all columns without first by DataFrame.iloc:

print (df)
   name     1     2   3     4     5   6   7     8     9    10  11    12    13  \
80  bob  27.0  29.0 NaN  29.0  30.0 NaN NaN  15.0  16.0  17.0 NaN  28.0  30.0   

    14    15    16  
80 NaN  28.0  18.0  


df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
print (df)
   name   1   2   3   4   5     6     7     8     9    10    11    12    13  \
80  bob NaN NaN NaN NaN NaN  27.0  29.0  29.0  30.0  15.0  16.0  17.0  28.0   

      14    15    16  
80  30.0  28.0  18.0  

Function:

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

Performance:

#100 rows
df = pd.concat([df] * 100, ignore_index=True)

#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)

#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] =  df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan,  side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252