3

If I have a dataframe (df_data) like:

ID        Time                X        Y        Z        H
05  2020-06-26 14:13:16    0.055    0.047    0.039    0.062
05  2020-06-26 14:13:21    0.063    0.063    0.055    0.079
05  2020-06-26 14:13:26    0.063    0.063    0.063    0.079
05  2020-06-26 14:13:31    0.095    0.102    0.079    0.127
...    ..    ...     ...     ...      ...      ...      ...
01  2020-07-01 08:59:43    0.063    0.063    0.047    0.079
01  2020-07-01 08:59:48    0.055    0.055    0.055    0.079
01  2020-07-01 08:59:53    0.071    0.063    0.055    0.082
01  2020-07-01 08:59:58    0.063    0.063    0.047    0.082
01  2020-07-01 08:59:59    0.047    0.047    0.047    0.071

[17308709 rows x 8 columns]

which I want to filter by another dataframe of intervals (df_intervals), like:

int_id         start               end
1            2020-02-03 18:11:59   2020-02-03 18:42:00
2            2020-02-03 19:36:59   2020-02-03 20:06:59
3            2020-02-03 21:00:59   2020-02-03 21:31:00
4            2020-02-03 22:38:00   2020-02-03 23:08:00
5            2020-02-04 05:55:00   2020-02-04 06:24:59
...                         ...                   ...
1804         2021-01-10 13:50:00   2021-01-10 14:20:00
1805         2021-01-10 18:10:00   2021-01-10 18:40:00
1806         2021-01-10 19:40:00   2021-01-10 20:10:00
1807         2021-01-10 21:25:00   2021-01-10 21:55:00
1808         2021-01-10 22:53:00   2021-01-10 23:23:00

[1808 rows x 2 columns]

what is the most efficient way to do so? I have a large dataset and if I try to iterate over it like:

for i in range(len(intervals)):
    df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i])
    ...
    ...
    ...

it takes forever! I know that I shouldn't iterate over large dataframes, but I have no idea how I could filter it by every interval on the second dataframe.

The steps I'm trying to do are:

1- Get all the intervals (start/end columns) from df_intervals;

2- Use those intervals to create a new dataframe (df_stats) containing the statistics of the columns within those time ranges. Example:

      start                  end             ID    X_max    X_min    X_mean    Y_max    Y_min    Y_mean    ....
2020-02-03 18:11:59   2020-02-03 18:42:00    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    03    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    04    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    05    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    03    ...    ...    ...     ...   ...    ...    ...     ...

2 Answers2

2

If Joe's answer does not give you the speed you desire I think it can be improved by eliminating the stats calculation in your for loop. ( I'm stealing his df creation because he is a hero for putting that in his answer.) Ideally, you could get rid of the for loop too, but I think with the timestamp indices being duplicated (across the ID numbers) it could be tricky to merge these two data frames.

Here is my attempt still using the iteration to handle the start/end times. First, I apply the int_id to the parent df. I want to add this to the parent dataframe so I can 'groupby' without making a 'temp' dataframe and doing my statistics to that.

for index, row in df2.iterrows():
    
    df1.loc[df1.Time.between(row.start,row.end), 'int_id'] = row.int_id

    ID                Time      X      Y      Z      H  int_id
0   01 2020-02-03 18:13:16  0.011  0.012  0.013  0.014     1.0
1   01 2020-02-03 18:13:21  0.015  0.016  0.017  0.018     1.0
2   01 2020-02-03 18:13:26  0.013  0.013  0.013  0.013     1.0
3   01 2020-02-03 18:13:31  0.015  0.015  0.015  0.015     1.0
4   02 2020-02-03 18:13:16  0.021  0.022  0.023  0.024     1.0
5   02 2020-02-03 18:13:21  0.025  0.026  0.027  0.028     1.0
6   02 2020-02-03 18:13:26  0.023  0.023  0.023  0.023     1.0

Then I define the aggregations to get it all in one go AFTER the loop is finished.

aggs = {'X':['sum', 'max', 'mean', 'median'], 
        'Y':['sum', 'max', 'mean', 'median'], 
        'Z':['sum', 'max', 'mean', 'median'], 
        'H':['sum', 'max', 'mean', 'median']}

df2 = df1.groupby(by=('int_id')).agg(aggs)

            X                            Y                             Z                            H                        
          sum    max    mean median    sum    max     mean median    sum    max    mean median    sum    max     mean  median
int_id                                                                                                                       
1.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
2.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345

Note: here you have a multi index on the columns. You can can join them with the following.

df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]

        X_sum  X_max  X_mean  X_median  Y_sum  Y_max   Y_mean  Y_median  Z_sum  Z_max  Z_mean  Z_median  H_sum  H_max   H_mean  H_median
int_id                                                                                                                                  
1.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
2.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
Correy Koshnick
  • 211
  • 1
  • 8
  • 1
    Thanks for the post. Upvoted. See the improvement. Agree with it – Joe Ferndz Feb 04 '21 at 17:24
  • thanks both of your for taking your time to help me! I made some tests on this situation: my df1 has 17174122 rows and my df2 has 1786 rows... iterating over them as both of you suggested (iterrows) took 7min20sec! it's a little better than what I'm currently doing but I wonder if there's any other way that'd be more efficient or that's the best I'll get since my databases are big! thanks again for helping me – VictorSalles Feb 06 '21 at 20:01
  • It will be useful to determine which part is slowest before trying to speed things up. I would suggest splitting your code into four functions. The first to load in your giant dataframe. One that does the iterrows() part to add groups do your big data frame. The next will accepts the dataframe with the groups already added, and the last one will save the output. If you run cProfile against these functions you can see who is the slowest. https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script – Correy Koshnick Feb 08 '21 at 18:28
  • @CorreyKoshnick sorry, I wasn't clear on my answer! I checked the time elapsed only on the iterrows() part to add groups to the big df (that single loop took 7+ min)... now I checked the second part, which is the groupby aggregations, and it took 2 seconds! so the slowest part is the iteration over the big df – VictorSalles Feb 10 '21 at 13:03
1

Here's the full code to get this done. I have tried to create some sample data to see if this works. Please run this against your full dataset and see if this provides you the results you need.

  1. Step 1: Create a temp list to store the interim dataframe.

    temp_list = []

  2. Step 2: Iterate through dataframe 2. For each row selected, do the following:

    • filter the rows from dataframe 1 for start and end date

      temp = df1[df1.Time.between(row.start,row.end)]

    • Groupby ID and get the statistical values for X, Y, Z, and H. One set for each column

      x = temp.groupby('ID' ['X'].agg(['min', 'max', 'mean', 'median']).add_prefix('X_').reset_index()

    • Merge all the X, Y, Z, H items into a single dataframe.

    • Add Start and End date to the merged dataframe

    • Append the dataframe to the temp_list

  3. Step 3: Create a final dataframe with the temp_list

  4. Step 4: Rearrange the columns per your needs. Start & End date as first two columns, then ID, then X values, Y values, Z values, and finally H values.

  5. Step 5: Print the dataframe

Full Code to get this done:

c1 = ['ID','Time','X','Y','Z','H']
d1 = [
['01','2020-02-03 18:13:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 18:13:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 18:13:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 18:13:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 18:13:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 18:13:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 18:13:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 18:13:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 18:13:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 18:13:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 18:13:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 18:13:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 18:13:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 18:13:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 18:13:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 18:13:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 18:13:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 18:13:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 18:13:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 18:13:31',0.095,0.102,0.079,0.127],
     
['01','2020-02-03 20:03:16',0.011,0.012,0.013,0.014],
['01','2020-02-03 20:03:21',0.015,0.016,0.017,0.018],
['01','2020-02-03 20:03:26',0.013,0.013,0.013,0.013],
['01','2020-02-03 20:03:31',0.015,0.015,0.015,0.015],
     
['02','2020-02-03 20:03:16',0.021,0.022,0.023,0.024],
['02','2020-02-03 20:03:21',0.025,0.026,0.027,0.028],
['02','2020-02-03 20:03:26',0.023,0.023,0.023,0.023],
['02','2020-02-03 20:03:31',0.025,0.025,0.025,0.025],
     
['03','2020-02-03 20:03:16',0.031,0.032,0.033,0.034],
['03','2020-02-03 20:03:21',0.035,0.036,0.037,0.038],
['03','2020-02-03 20:03:26',0.033,0.033,0.033,0.033],
['03','2020-02-03 20:03:31',0.035,0.035,0.035,0.035],

['04','2020-02-03 20:03:16',0.041,0.042,0.043,0.044],
['04','2020-02-03 20:03:21',0.045,0.046,0.047,0.048],
['04','2020-02-03 20:03:26',0.043,0.043,0.043,0.043],
['04','2020-02-03 20:03:31',0.045,0.045,0.045,0.045],
     
['05','2020-02-03 20:03:16',0.055,0.047,0.039,0.062],
['05','2020-02-03 20:03:21',0.063,0.063,0.055,0.079],
['05','2020-02-03 20:03:26',0.063,0.063,0.063,0.079],
['05','2020-02-03 20:03:31',0.095,0.102,0.079,0.127],
     
['01','2020-07-01 08:59:43',0.063,0.063,0.047,0.079],
['01','2020-07-01 08:59:48',0.055,0.055,0.055,0.079],
['01','2020-07-01 08:59:53',0.071,0.063,0.055,0.082],
['01','2020-07-01 08:59:58',0.063,0.063,0.047,0.082],
['01','2020-07-01 08:59:59',0.047,0.047,0.047,0.071]]

import pandas as pd
df1 = pd.DataFrame(d1,columns=c1)
df1.Time = pd.to_datetime(df1.Time)

c2 = ['int_id','start','end']
d2 = [[1,'2020-02-03 18:11:59','2020-02-03 18:42:00'],
[2,'2020-02-03 19:36:59','2020-02-03 20:06:59'],
[3,'2020-02-03 21:00:59','2020-02-03 21:31:00'],
[4,'2020-02-03 22:38:00','2020-02-03 23:08:00'],
[5,'2020-02-04 05:55:00','2020-02-04 06:24:59'],
[1804,'2021-01-10 13:50:00','2021-01-10 14:20:00'],
[1805,'2021-01-10 18:10:00','2021-01-10 18:40:00'],
[1806,'2021-01-10 19:40:00','2021-01-10 20:10:00'],
[1807,'2021-01-10 21:25:00','2021-01-10 21:55:00'],
[1808,'2021-01-10 22:53:00','2021-01-10 23:23:00']]

import pandas as pd
from functools import reduce

df2 = pd.DataFrame(d2,columns=c2)

df2.start = pd.to_datetime(df2.start)
df2.end = pd.to_datetime(df2.end)

temp_list = []

for i, row in df2.iterrows():

    temp = df1[df1.Time.between(row.start,row.end)]

    x = temp.groupby('ID')['X'].agg(['min','max','mean','median']).add_prefix('X_').reset_index()
    y = temp.groupby('ID')['Y'].agg(['min','max','mean','median']).add_prefix('Y_').reset_index()
    z = temp.groupby('ID')['Z'].agg(['min','max','mean','median']).add_prefix('Z_').reset_index()
    h = temp.groupby('ID')['H'].agg(['min','max','mean','median']).add_prefix('H_').reset_index()

    data_frames = [x,y,z,h]

    df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
                            how='outer'), data_frames).fillna('void')

    df_merged['start'] = row.start
    df_merged['end'] = row.end
    
    temp_list.append(df_merged)


df_final = pd.concat(temp_list, ignore_index=True)

column_names = ['start','end','ID',
                    'X_min','X_max','X_mean','X_median',
                    'Y_min','Y_max','Y_mean','Y_median',
                    'Z_min','Z_max','Z_mean','Z_median',
                    'H_min','H_max','H_mean','H_median']

df_final = df_final[column_names]

print (df_final)

The output of this will be:

                start                 end  ID  ...  H_max   H_mean  H_median
0 2020-02-03 18:11:59 2020-02-03 18:42:00  01  ...  0.018  0.01500    0.0145
1 2020-02-03 18:11:59 2020-02-03 18:42:00  02  ...  0.028  0.02500    0.0245
2 2020-02-03 18:11:59 2020-02-03 18:42:00  03  ...  0.038  0.03500    0.0345
3 2020-02-03 18:11:59 2020-02-03 18:42:00  04  ...  0.048  0.04500    0.0445
4 2020-02-03 18:11:59 2020-02-03 18:42:00  05  ...  0.127  0.08675    0.0790
5 2020-02-03 19:36:59 2020-02-03 20:06:59  01  ...  0.018  0.01500    0.0145
6 2020-02-03 19:36:59 2020-02-03 20:06:59  02  ...  0.028  0.02500    0.0245
7 2020-02-03 19:36:59 2020-02-03 20:06:59  03  ...  0.038  0.03500    0.0345
8 2020-02-03 19:36:59 2020-02-03 20:06:59  04  ...  0.048  0.04500    0.0445
9 2020-02-03 19:36:59 2020-02-03 20:06:59  05  ...  0.127  0.08675    0.0790
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33