I have a Pandas DataFrame that I need to:
- group by the ID column (not in index)
- forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (
np.nan
)
For each ID categorical value and each metric column (see the aX
columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan
).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1
to a4
.
I would like to:
- pretend
np.nan
is zero0.0
when on the same column and different row (with same ID) there is a number so I can sum them together like withgroupby
and subsequent aggregation functions - forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
- for ID
1
"a2"=100.0
- for ID
2
"a1"
and"a2"
are bothnp.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row, not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,)
then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum()
then I see 0.0
everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum)
the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?