I'm preprocessing data from large dataframe into multiple small numpy files according to x,y coordinates.
i.e. I have many pixelwise values saved in dataframe(P2), and I want to split it into smaller image like pieces (npy).
My steps are as follows.
- Loop pieces of coordinates
- Convert the piece of coordinates into a dataframe: P1
- Merge P1 with the raw data:P2 by ["X", "Y"] in P2 as P_merge
- Save P_merge as numpy files
However, since there are many pieces of coordinates (about 1000) and P2 is pretty large (38976147 rows * 13 columns), the code runs really slow.
I've try to run the for loop by multiprocess but I failed on too many global variables here.
I also tried to use dask. However, since I need to save into npys, it would stuck at converting dask dataframe into numpy.
What's the proper way to speed up for loop or panda merge here? Thank you.
def split_2_pieces(img_coors: List[List[float]], P2, img_h, img_w, file_name):
# Step1
for (x, y) in img_coors:
# Step2
X, Y = np.meshgrid(x, y) # x is a list, y is a list
P1 = {"X": X.flatten(), "Y": Y.flatten()}
P1 = pd.DataFrame(data=P1)
# Step3
P_merge = P1.merge(P2[["X", "Y", "value1", ...]], how="left", on=["X", "Y"]).fillna(0)
# Step4
P_merge_npy = np.array(P_merge).reshape(img_h, img_w, 13)
np.save(Path(path, str(idx) + file_name), P_merge_npy)
The reason I need to meshgrid in step2 is that some datapoint does not appear in P2.
In order to save pixelwise map as npy files, I first generate x, y then meshgrid, and then fill in the data of P2 by pandas.merge according to "X", "Y".
The P2 is a dataframe: something like this.
X | Y | Name | value | power1 | ... | power9 |
---|---|---|---|---|---|---|
1071.564 | 2478.21 | u_mcuss | 0.0072 | 0.01 | ... | 0.023 |