0

I'm preprocessing data from large dataframe into multiple small numpy files according to x,y coordinates.

i.e. I have many pixelwise values saved in dataframe(P2), and I want to split it into smaller image like pieces (npy).

My steps are as follows.

  1. Loop pieces of coordinates
  2. Convert the piece of coordinates into a dataframe: P1
  3. Merge P1 with the raw data:P2 by ["X", "Y"] in P2 as P_merge
  4. Save P_merge as numpy files

However, since there are many pieces of coordinates (about 1000) and P2 is pretty large (38976147 rows * 13 columns), the code runs really slow.

I've try to run the for loop by multiprocess but I failed on too many global variables here.

I also tried to use dask. However, since I need to save into npys, it would stuck at converting dask dataframe into numpy.

What's the proper way to speed up for loop or panda merge here? Thank you.

def split_2_pieces(img_coors: List[List[float]], P2, img_h, img_w, file_name):
    # Step1
    for (x, y) in img_coors:
        # Step2
        X, Y = np.meshgrid(x, y) # x is a list, y is a list
        P1 = {"X": X.flatten(), "Y": Y.flatten()}
        P1 = pd.DataFrame(data=P1)
        
        # Step3
        P_merge = P1.merge(P2[["X", "Y", "value1", ...]], how="left", on=["X", "Y"]).fillna(0)
        
        # Step4
        P_merge_npy = np.array(P_merge).reshape(img_h, img_w, 13)
        np.save(Path(path, str(idx) + file_name), P_merge_npy)

The reason I need to meshgrid in step2 is that some datapoint does not appear in P2.

In order to save pixelwise map as npy files, I first generate x, y then meshgrid, and then fill in the data of P2 by pandas.merge according to "X", "Y".

The P2 is a dataframe: something like this.

X Y Name value power1 ... power9
1071.564 2478.21 u_mcuss 0.0072 0.01 ... 0.023

1 Answers1

0

According to the method by enter link description here Solution 2, I first extract the rows from P2 that really matters by "np.where".

By doing this, the merge step will be speeded up since P2 have left much less rows.

def split_2_pieces(img_coors: List[List[float]], P2, img_h, img_w, file_name):
    # Step1
    for (x, y) in img_coors:
        # Step2
        X, Y = np.meshgrid(x, y) # x is a list, y is a list
        P1 = {"X": X.flatten(), "Y": Y.flatten()}
        P1 = pd.DataFrame(data=P1)
        
        # New step 
        x_min, x_max = x.min(), x.max()
        y_min, y_max = y.min(), y.max()
        df_X, df_Y = P2.X.values, P2.Y.values

        row_idx = np.where((df_X >= x_min) & (df_X <= x_max) & df_Y >= y_min) & (df_Y <= y_max))
        P2_extracted = P2.iloc[row_idx]
        
        # Step3 (Modified)
        P_merge = P1.merge(P2_extracted [["X", "Y", "value1", ...]], how="left", on=["X","Y"]).fillna(0)
    
        # Step4
        P_merge_npy = np.array(P_merge).reshape(img_h, img_w, 13)
        np.save(Path(path, str(idx) + file_name), P_merge_npy)

The original run time reduces from 18.7 seconds to 1.13 seconds!