Compare rows with conditions and generate a new dataframe in Pandas

Question

I have a very big dataframe with this structure:

Timestamp    Val1

Here you can see a real sample:

    Timestamp           Temp         
0   1622471518.92911    36.443       
1   1622471525.034114   36.445       
2   1622471531.148139   37.447      
3   1622471537.284337   36.449      
4   1622471543.622588   43.345      
5   1622471549.734765   36.451      
6   1622471556.2518     36.454      
7   1622471562.361368   41.461     
8   1622471568.472718   42.468   
9   1622471574.826475   36.470

What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.

Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.

This is my code:

cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values

results = []
for i in range(len(temps)):
    for j in range(i+1, len(temps)):
        print(i,j,len(temps))
        if float(temps[j]) > float(temps[i])*maxim:
            timeIn = dt.datetime.fromtimestamp(float(times[i]))
            timeOut = dt.datetime.fromtimestamp(float(times[j]))
            diff = timeOut - timeIn
            tdiff = diff.total_seconds()
            
            if dd > cap_time:
                break
            else:
                res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
                results.append(res)
                break

# Then I save it in a dataframe and another actions

Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).

Thank you very much.

Please post a sample of your original dataframe. `print(df.head(10))` for instance. — Corralien, Jun 08 '21 at 09:22

Arne · Accepted Answer · 2021-06-09T08:52:37.153

Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:

import numpy as np

temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]

condition = np.logical_and(temps_diff > maxim, 
                           times_diff < cap_time)

results = []
for i in range(len(temps)):
    for j in range(i+1, len(temps)):
        if condition[i, j]:
            results.append([temps[i], temps[j], 
                            times[i], times[j], 
                            times_diff[i, j]])
            
results

[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
 [36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]

To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:

import numpy as np

n = len(temps)

temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]

condition = np.logical_and(temps_diff > maxim, 
                           times_diff < cap_time)
            
results_full = np.stack([np.repeat(temps[:, None], n, axis=1), 
                         np.tile(temps, (n, 1)), 
                         np.repeat(times[:, None], n, axis=1), 
                         np.tile(times, (n, 1)), 
                         times_diff])

results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T

array([[ 3.64430000e+01,  4.33450000e+01,  1.62247152e+09,
         1.62247154e+09,  2.46934779e+01],
       ...
       [ 3.64540000e+01,  4.24680000e+01,  1.62247156e+09,
         1.62247157e+09,  1.22209179e+01],
       ... 
      ])

As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.

Looks great! But `times_diff = times - times[:, None]` is to much for my computer and crash it, `MemoryError: Unable to allocate 13.6 GiB for an array with shape (42778, 42778) and data type float64`... — Lleims, Jun 09 '21 at 07:26
Since you're only interested in pairwise comparisons where the time difference is less than `cap_time`, you could divide the data into batches, as long as you let them overlap by `cap_time`. — Arne, Jun 09 '21 at 08:07
mmm I'm new in broadcasting, but I'm thinking, if I batch my data, I not lose combinations between batches? Looks very interesting this topic, but I'm not sure to know how to do it right now. Thank you very much! @Arne — Lleims, Jun 09 '21 at 09:21
Assuming the data are sorted by time, say the last row in batch 1 has time `t_1`, then if you let batch 2 start at time `t_1 - cap_time`, and so on, then you won't lose any relevant combinations between batches. — Arne, Jun 09 '21 at 10:15

Compare rows with conditions and generate a new dataframe in Pandas

1 Answers1