Is there a fast way to avoid for loops for generating a new data frame based on data from another data frame?

Question

I need to find the data in-between certain triggers in a data frame that contains several Real-time signals and one timestamp column. These triggers are saved on a separate data frame. I have a working solution with a for-loop, which takes ages on the relatively large dataset I am using. I am trying to speed things up with list-comprehensions, but I cannot find a working solution. Does anyone have an idea of how to do this faster?

The for-loop solutions:

This is how the trigger data frame looks like:

      timestamp_x  start_trig            id        timestamp_y  end_trig
0   1592724037612         1.0            12      1592724068408      -1.0
1   1592724459283         1.0            23      1592724490290      -1.0
2   1592724514246         1.0            17      1592724545450      -1.0

And this is the code to append the rows in the old data frame with the name "data" to the new data frame with the name new_df

for i in range(len(df_trig)):
            mask = data[(data['timestamp'] > df_trig.iloc[i].timestamp_x) & (data['timestamp'] < df_trig.iloc[i].timestamp_y)]
            new_df = new_df.append(mask)

This is what I tried to solve this with a generator, which is not working for me:

new_df = pd.DataFrame(data[(data['timestamp'] > low_lim) & (data['timestamp'] < upp_lim)] for low_lim,upp_lim in zip(df_trig['timestamp_x'], df_trig['timestamp_y']))

Thank you for your answers already in advance!

EDIT:

The original data frame contains looks as following:

        timestamp          id        param1
0   1592724037612          23          56.1
1   1592724037712          23          56.1
2   1592724037812          23          56.0
...
100 1592724047612          17          54.7

The objective is to move all the rows, which are in-between any of the trigger pairs from the other data frame. Therefore, the new data frame looks almost identically except that the rows are missing which are outside of the ranges defined in the trigger data frame.

@sygneto I edited the original question above to make it easier to understand. I would really appreciate your help. — mr_froe, Aug 07 '20 at 16:51

score 1 · Accepted Answer · answered Aug 08 '20 at 10:37

1

I found a solution based on a list comprehension which sped up the processing time:

list_cycles = [data[(data['timestamp'] > low_lim) & (data['timestamp'] < upp_lim)] for low_lim,upp_lim in zip(df_trig['timestamp_x'], df_trig['timestamp_y'])]
new_df = pd.concat(list_cycles,ignore_index = True)

answered Aug 08 '20 at 10:37

mr_froe

21
4

From the Pandas docs: "Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html – jsmart Aug 08 '20 at 13:12
@jsmart Thanks for your advice. "list_cycles" is a list of data frames, each carrying the rows for a pair of start and end trigger. During my research here I found a thread telling me that pd.concat is the most efficient to generate a common data frame out of all in the list. You suggest me to append a list with all the rows from the separated data frames and then generate a new data frame of this list, right? Whether this yields a performance improvement, I may figure out. – mr_froe Aug 09 '20 at 04:19
Yes, that's correct (i.e., the list comprehension builds a list of data frames, and concatenates it at the end). That will be faster than repeatedly appending to a data frame each time through the list. Please consider posting your performance results. – jsmart Aug 09 '20 at 13:09
Try changing list_cycles to a 'generator comprehension', like this: `list_cycles = (data[(data['timestamp'] > low_lim) & (data['timestamp'] < upp_lim)] for low_lim,upp_lim in zip(df_trig['timestamp_x'], df_trig['timestamp_y']))` -- the right-hand side is enclosed in parens, not square brackets. Reason: lower memory usage. – jsmart Aug 09 '20 at 13:11
After changing from the loop to the generator expression and several other minor improvements, I managed to run the code in 2 minutes, which was running previously for 1.5 hours. I don't know why the improvement is so crazy. – mr_froe Aug 12 '20 at 14:11

score 0 · Answer 2 · answered Aug 07 '20 at 15:11

I made some code without for loop. As I know, it is faster to use built-in function of pandas library than to use for loop. But, I'm not sure this is fast enough in your case.

Set up

dataframe trig

dataframe data

Do merging trig to data like cartesian product style. This makes all the case of rows. I found trick to achieve this in stack overflow, which uses temporary key.

data.assign(key=1).merge(trig.assign(key=1)).drop(columns='key')

and it yields

Screen rows that meet the triggers.

data.assign(key=1).merge(trig.assign(key=1)).drop(columns='key').query('timestamp_x < timestamp < timestamp_y')

and it yields

@sagmoo Thank you for your answer. Though this works for small datasets and just a few trigger signals, it is not applicable for me as the trigger data frame contains almost 300 rows, and the original dataset consists of 1,000,000 rows. — mr_froe, Aug 07 '20 at 16:56

Is there a fast way to avoid for loops for generating a new data frame based on data from another data frame?

2 Answers2