What is the most efficient way to insert large amounts of data that is generated in a python script?
I am retrieving .grib
files about weather parameters from several sources. These grib files consist of grid-based data (1201x2400x80
), which results in a large amount of data.
I have written a script where each value is combined with the corresponding longitude and latitude, resulting in a data structure as follows:
+--------------------+-------+-------+--------+--------+
| value|lat_min|lat_max| lon_min| lon_max|
+--------------------+-------+-------+--------+--------+
| 0.0011200|-90.075|-89.925|-180.075|-179.925|
| 0.0016125|-90.075|-89.925|-179.925|-179.775|
+--------------------+-------+-------+--------+--------+
I have tried looping through each of the 80 time steps, and creating a pyspark dataframe, as well as reshaping the whole array into (230592000,)
, but both methods seem to either take ages to complete or fry the cluster's memory.
I have just discovered Resilient Distributed Dataset's (RDD), and am able to use the map function to create the full 230M entries in RDD format, converting this to a DataFrame or writing it to a file is very slow again.
Is there a way to multithread/distribute/optimize this in a way that is both efficient and doesn't need large amounts of memory?
Thanks in advance!