Optimizing Iterations on large datsets

Question

I have two large datasets df1 and df2, both have a column that records the time each observation was made. I want to find the time difference between every entry of df1 and every entry of df2.

The code below works but runs into memory errors when I attempt to run it on the entire datasets. How can I optimize this for memory efficiency?

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1

for i in range(1000) :
    for j in range(1000) :
        LINE_NUMBER_table0.append(i) # Add the number of row i of table0
        LINE_NUMBER_table1.append(j) # Add the number of row j of table1 
        timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
        TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE

score 0 · Answer 1 · answered May 08 '22 at 17:55

You do not need a loop for that. Python loops are generally inefficient (especially iterating on Pandas dataframes, see this post). You need to use vectorized calls instead. For example, Numpy functions or the ones of Pandas. In this case, you can use np.tile and np.repeat. Here is an (untested) example:

import numpy as np

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

tmp = np.arange(1000)
LINE_NUMBER_table0 = np.repeat(tmp, 1000)
LINE_NUMBER_table1 = np.tile(tmp, 1000)

df1_mjd = np.repeat(df1["mjd"].to_numpy(), 1000)
df2_MJD = np.tile(df2["MJD"].to_numpy(), 1000)
TIME_DIFFERENCE = df1_mjd - df2_MJD

Note that you can convert Numpy array back to list using your_array.tolist() but it is better to work with Numpy array for sake of performance (note that Pandas uses Numpy array internally so the conversion between Pandas datafram and Numpy array is cheap as opposed to lists).

Optimizing Iterations on large datsets

1 Answers1