I have two large datasets df1
and df2
, both have a column that records the time each observation was made. I want to find the time difference between every entry of df1
and every entry of df2
.
The code below works but runs into memory errors when I attempt to run it on the entire datasets. How can I optimize this for memory efficiency?
df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")
LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1
for i in range(1000) :
for j in range(1000) :
LINE_NUMBER_table0.append(i) # Add the number of row i of table0
LINE_NUMBER_table1.append(j) # Add the number of row j of table1
timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE