Speed up pandas merge and for loop python

Question

I'm preprocessing data from large dataframe into multiple small numpy files according to x,y coordinates.

i.e. I have many pixelwise values saved in dataframe(P2), and I want to split it into smaller image like pieces (npy).

My steps are as follows.

Loop pieces of coordinates
Convert the piece of coordinates into a dataframe: P1
Merge P1 with the raw data:P2 by ["X", "Y"] in P2 as P_merge
Save P_merge as numpy files

However, since there are many pieces of coordinates (about 1000) and P2 is pretty large (38976147 rows * 13 columns), the code runs really slow.

I've try to run the for loop by multiprocess but I failed on too many global variables here.

I also tried to use dask. However, since I need to save into npys, it would stuck at converting dask dataframe into numpy.

What's the proper way to speed up for loop or panda merge here? Thank you.

def split_2_pieces(img_coors: List[List[float]], P2, img_h, img_w, file_name):
    # Step1
    for (x, y) in img_coors:
        # Step2
        X, Y = np.meshgrid(x, y) # x is a list, y is a list
        P1 = {"X": X.flatten(), "Y": Y.flatten()}
        P1 = pd.DataFrame(data=P1)
        
        # Step3
        P_merge = P1.merge(P2[["X", "Y", "value1", ...]], how="left", on=["X", "Y"]).fillna(0)
        
        # Step4
        P_merge_npy = np.array(P_merge).reshape(img_h, img_w, 13)
        np.save(Path(path, str(idx) + file_name), P_merge_npy)

The reason I need to meshgrid in step2 is that some datapoint does not appear in P2.

In order to save pixelwise map as npy files, I first generate x, y then meshgrid, and then fill in the data of P2 by pandas.merge according to "X", "Y".

The P2 is a dataframe: something like this.

X	Y	Name	value	power1	...	power9
1071.564	2478.21	u_mcuss	0.0072	0.01	...	0.023

What does your dataframe look like, can you share a couple of rows from it? — BeRT2me, Jun 01 '22 at 07:21
@BeRT2me Hi, I'm sorry that I could not screen shot my dataframe. I edit the P2 dataframe from scratch above. — domokun0413, Jun 01 '22 at 08:03
Merging on index used to be faster, I would say try that out using X and Y as index for both P1 and P2 — Zaero Divide, Jun 01 '22 at 15:51

score 0 · Answer 1 · answered Jun 01 '22 at 09:26

According to the method by enter link description here Solution 2, I first extract the rows from P2 that really matters by "np.where".

By doing this, the merge step will be speeded up since P2 have left much less rows.

def split_2_pieces(img_coors: List[List[float]], P2, img_h, img_w, file_name):
    # Step1
    for (x, y) in img_coors:
        # Step2
        X, Y = np.meshgrid(x, y) # x is a list, y is a list
        P1 = {"X": X.flatten(), "Y": Y.flatten()}
        P1 = pd.DataFrame(data=P1)
        
        # New step 
        x_min, x_max = x.min(), x.max()
        y_min, y_max = y.min(), y.max()
        df_X, df_Y = P2.X.values, P2.Y.values

        row_idx = np.where((df_X >= x_min) & (df_X <= x_max) & df_Y >= y_min) & (df_Y <= y_max))
        P2_extracted = P2.iloc[row_idx]
        
        # Step3 (Modified)
        P_merge = P1.merge(P2_extracted [["X", "Y", "value1", ...]], how="left", on=["X","Y"]).fillna(0)
    
        # Step4
        P_merge_npy = np.array(P_merge).reshape(img_h, img_w, 13)
        np.save(Path(path, str(idx) + file_name), P_merge_npy)

The original run time reduces from 18.7 seconds to 1.13 seconds!

Speed up pandas merge and for loop python

1 Answers1