I have 1024 parquet files, each 1mbin size. I'm using python dask
to merge those 1024 files into a single file and I have a lot of disk space, but ram is some what limited.
Is there any efficient way to solve this using python dask
?
import dask.dataframe as dd
def generatePath():
for i in range(0,1024):
return "data/2000-" + i +".parquet"
def readDF():
paths = generatePath()
for x in paths:
df = dd.read_parquet(x, columns=['name', 'address'], engine='pyarrow')
yield df
def mergeDF():
allDF = readDF()
df = next(allDF)
for iter_DF in allDF:
df = dd.concat([df,iter_DF])
return df.compute()
Here is my code and it throws out memory errors. Correct me if I am wrong under the hood. The code is loading file by file and creating each DF and then concatenating. In such case, it doesn't requires lot of memory?
Is there any other way to solve?