2

I have 1024 parquet files, each 1mbin size. I'm using python dask to merge those 1024 files into a single file and I have a lot of disk space, but ram is some what limited.

Is there any efficient way to solve this using python dask?

import dask.dataframe as dd
def generatePath():
    for i in range(0,1024):
         return "data/2000-" + i +".parquet"

def readDF():
    paths = generatePath()
    for x in paths:
       df = dd.read_parquet(x, columns=['name', 'address'], engine='pyarrow')
       yield df

def mergeDF():
    allDF = readDF()
    df = next(allDF)
    for iter_DF in allDF:
        df = dd.concat([df,iter_DF])
    return df.compute()

Here is my code and it throws out memory errors. Correct me if I am wrong under the hood. The code is loading file by file and creating each DF and then concatenating. In such case, it doesn't requires lot of memory?

Is there any other way to solve?

edesz
  • 11,756
  • 22
  • 75
  • 123
Learnis
  • 526
  • 5
  • 25
  • This question is about concatenating parquet files, using Dask, into a single parquet file - see [comment](https://stackoverflow.com/questions/61759297/dask-dataframe-concatenating-parquet-files-throws-out-of-memory#comment109267906_61774968). – edesz May 14 '20 at 10:34
  • your generate path funcion is returning instead of yielding – adir abargil Apr 02 '23 at 09:44

1 Answers1

3

Updated answer

To read and combine multiple files into a single .parquet, try .repartition(1) - see this SO post

# Read all files in `data/`
df = dd.read_parquet("data/", columns=['name', 'address'], engine='pyarrow')

# Export to single `.parquet` file
df.repartition(npartitions=1).to_parquet("data/combined", write_metadata_file=False)

This will combine all the files in data/ into a single file

$ ls data/combined
part.0.parquet

Note: There are benefits to using multiple parquet files - 1, 2, 3, 4.

Old answer

There is no need to compute just to read the data. It will quickly fill up your RAM. This is likely causing your memory error. You can use dd.read_parquet and specify the data/ folder directly

df = dd.read_parquet("data/", columns=['name', 'address'], engine='pyarrow')
edesz
  • 11,756
  • 22
  • 75
  • 123
  • 1
    .compute for every iteration is indentation mistake and I'm calling it once. my data is in multiple directories,just for the example i have added in that way. Ultimate aim is to write that in single parquet file. Im thinking of trying to stream the data by limiting the size and ill try to right – Learnis May 13 '20 at 12:59
  • Thanks for the clarification. IIUC, you want to combine all the files into a single `.parquet` file. See my updated answer. – edesz May 13 '20 at 15:19