OOM error while reading .parquet file. How do I solve this?

Asked Aug 17 '23 at 19:24

Active Aug 17 '23 at 19:32

Viewed 11 times

I am working on a ETL project. For that I am trying to read a .parquet file in order to see, transform the data and upload it.
I´ve been failing with that as I always get an "OOM error" while reading it.

Is there some way I could read this locally?

This is my code currently:

import dask.dataframe as dd
import os
from dask.distributed import Client

def main():

    client = Client() 
    print(f"Dashboard link: {client.dashboard_link}")

    current_dir = os.getcwd()
    file_path = os.path.join(current_dir, "part-00000-4333534a-3d5-41162-8f14-ee4123233-e000.snappy.parquet")

    ddf = dd.read_parquet(file_path, engine='pyarrow').head(100)

    print(f"ESTE ES EL FILE REDUCIDO: \n\n{ddf}")

    client.close()

if __name__ == '__main__':
    main()

I´ve tried using pandas and dask as well, and tried the fastparquet engine too. The file is 1.9 GB (I have 50 of them to process) and my pc has 8 gb of RAM.

edited Aug 17 '23 at 19:32

eglease

2,445
11
18
28

asked Aug 17 '23 at 19:24

mdein

OOM error while reading .parquet file. How do I solve this?

0 Answers0