0

I want to split a dataset with ~ 1 million rows based on a multiple occurring ID and a mode type (active, not active). When splitting, missing values in numeric columns should be interpolated and categorical values should be filled with ffill. Finally, the remaining zero values are to be dropped. For this I have written the following function:

objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)

def fill_missing_values(df_group):
    df_group[objectList] = df_group[objectList].ffill()
    df_group[floatList] = df_group[floatList].interpolate(
        method="linear", limit_direction="forward"
    )
    df_group.dropna()
    return df_group

The function is now to be applied as follows:

df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
    lambda df_sorted: fill_missing_values(df_sorted)
)

The cell is executed without errors, but the output takes way too long. So my question is: Is this approach overall correct or am I missing something? And how can this code get more performance?

Input Data

df = pd.DataFrame(
    {
        "ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
        "MODE": [
            "active",
            "active",
            "active",
            "inactive",
            "inactive",
            "active",
            "active",
            "active",
            "inactive",
        ],
        "Signal1  ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
        "Signal2  ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
        "Signal3  ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
    }
)

df

    ID  MODE     Signal1  Signal2  Signal3
0   0A  active   13       NaN      on
1   0A  active   NaN      0.1      NaN
2   0A  active   4        0.3      NaN
3   0A  inactive 11       NaN      off
4   0A  inactive NaN      4.5      NaN
5   1C  active   22       NaN      on
6   1C  active   25       2.0      NaN
7   1C  active   NaN      3.0      on
8   1C  inactive 19       NaN      NaN

Desired Output after ffill and interpolation of ID "0A":

    ID  MODE      Signal1     Signal2   Signal3
0   0A  active    13.0        NaN       on
1   0A  active    8.5         0.1       on
2   0A  active    4.0         0.3       on
3   0A  inactive  11.0        NaN       off
4   0A  inactive  11.0        4.5       off

Desired Output after dropna of ID "0A":

    ID  MODE    Signal1  Signal2    Signal3
0   0A  active  8.5      0.1        on
1   0A  active  4.0      0.3        on
    ID  MODE      Signal1    Signal2    Signal3
0   0A  inactive  11         4.5        off
Horseman
  • 297
  • 1
  • 14
  • 1
    Please include a [minimum reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) - a sample of your input DataFrame and your expected output. – not_speshal Nov 22 '21 at 13:56
  • @not_speshal thanks for the tip. I added the dataframe and the expected output – Horseman Nov 22 '21 at 14:12
  • @not_speshal I simplified the example for better understanding – Horseman Nov 22 '21 at 15:07

2 Answers2

1

IIUC, you want:

  1. groupby the ID and MODE columns and interpolate all numeric columns
  2. groupby the ID and MODE columns and ffill all non-numeric columns
import numpy as np

#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)

numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns

df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")

>>> df
   ID      MODE  Signal1  Signal2 Signal3
0  0A    active     13.0      NaN      on
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
3  0A  inactive     11.0      NaN     off
4  0A  inactive     11.0      4.5     off
5  1C    active     22.0      NaN      on
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on
8  1C  inactive     19.0      NaN     NaN

>>> df.dropna()
   ID      MODE  Signal1  Signal2 Signal3
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
4  0A  inactive     11.0      4.5     off
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • The problem is, that the numeric columns depending on the mode. I cant interpolate the values of two different modes. – Horseman Nov 22 '21 at 15:06
  • @Horseman - Your earlier example, you interpolated 13 and 4 to get 8.5 where 13 was "active" and 4 was "inactive". I edited my answer to match your latest sample. – not_speshal Nov 22 '21 at 15:13
  • I reviewed my Question and made a more simple example for better understanding. Please forget the first example and stick to the actual one. – Horseman Nov 22 '21 at 15:18
0

First fill Signal1 with mean by :

df['Signal1']=df.groupby(['ID','MODE'])['Signal1'].apply(lambda x:x.fillna(x.mean()))

Next groupby to get Signal3 and merge it

signal3 = df[['ID','MODE','Signal3']].dropna().drop_duplicates()
signal3 = signal3.rename(columns={'Signal3':'Signal3_new'})
df2 = pd.merge(df,signal3, how='left', on=['ID','MODE'])

fillna in Signal3 with Signal3_new or use Signal3_new

Fariliana Eri
  • 181
  • 2
  • 5