2

I would like to efficiently merge two data frames into one, but one data frame has "more data" than the other. Example:

df_A = pd.DataFrame({"Time": [pd.to_datetime("09:11:37.600"),
                              pd.to_datetime("09:11:37.700"),
                              pd.to_datetime("09:11:37.800")],
                    "A": [0.1, 0.7, -1.1]})
df_B = pd.DataFrame({"Time": [pd.to_datetime("09:11:37.610"),
                              pd.to_datetime("09:11:37.640"),
                              pd.to_datetime("09:11:37.670"),
                              pd.to_datetime("09:11:37.700"),
                              pd.to_datetime("09:11:37.730"),
                              pd.to_datetime("09:11:37.760"),
                              pd.to_datetime("09:11:37.790"),
                              pd.to_datetime("09:11:37.820")],
                    "B": [0.3, -1.5, -0.5, 0.2, 1.2, -0.9, 0.1, -0.2]})

I would like to create a third data frame, df_C, that is an augmented copy of the shortest previous data frame. The augmentation is given by the "smoothed" columns from the longest data frame. The smoothing can be performed by an averaging operation, or another one. In this example, I would like to do the following:

  • For the row corresponding to pd.to_datetime("09:11:37.600"), average out the values 0.3, -1.5, -0.5 because they correspond to the times between pd.to_datetime("09:11:37.600") and the time in the next row, pd.to_datetime("09:11:37.700");
  • For the corresponding to pd.to_datetime("09:11:37.700"), average out the values 0.2, 1.2, -0.9, 0.1 because they correspond to the times between pd.to_datetime("09:11:37.700") and the time in the next row, pd.to_datetime("09:11:37.800");
  • And so on.

The data frame df_C would then have three columns: Time, A, and B, where Time and A come from df_A, and B is the "smoothed" column from df_B following the procedure above.

Is there a way to do this without explicitly writing a for loop, which can be expensive for very long data frames?

I tried the following, but it's copying the same value to all dates (i.e., it's not correct).

df_C = df_A.copy()
df_C.loc[:, "B"] = df_B.loc[(df_B["Time"] >= df_A.shift(1)["Time"].values[1]) & (df_B["Time"] < df_A.shift(-1)["Time"].values[0]), "B"].mean()
Bruno
  • 1,329
  • 2
  • 15
  • 35

1 Answers1

1

You can use the pandas.resample() method if you use your times as an index.

Here you can find the abbreviations used for the different periods.

Code

import pandas as pd

df_B = pd.DataFrame(
    {"B": [0.3, -1.5, -0.5, 0.2, 1.2, -0.9, 0.1, -0.2]},
    index = [
        pd.to_datetime("09:11:37.610"),
        pd.to_datetime("09:11:37.640"),
        pd.to_datetime("09:11:37.670"),
        pd.to_datetime("09:11:37.700"),
        pd.to_datetime("09:11:37.730"),
        pd.to_datetime("09:11:37.760"),
        pd.to_datetime("09:11:37.790"),
        pd.to_datetime("09:11:37.820")])

df_resampled = df_B.resample('100L').mean()

Result

                                B
2018-03-05 09:11:37.600 -0.566667
2018-03-05 09:11:37.700  0.150000
2018-03-05 09:11:37.800 -0.200000
Roald
  • 2,459
  • 16
  • 43
  • Thanks! So by changing the definition of `df_A` to match `df_B`'s (i.e., with the timestamp as the index) and then doing `df_C = df_A.merge(df_resampled, how="outer", left_index=True, right_index=True)`, I get the desired outcome. – Bruno Mar 06 '18 at 12:42