0

I am working on a ETL project. For that I am trying to read a .parquet file in order to see, transform the data and upload it.
I´ve been failing with that as I always get an "OOM error" while reading it.

Is there some way I could read this locally?

This is my code currently:

import dask.dataframe as dd
import os
from dask.distributed import Client

def main():

    client = Client() 
    print(f"Dashboard link: {client.dashboard_link}")

    current_dir = os.getcwd()
    file_path = os.path.join(current_dir, "part-00000-4333534a-3d5-41162-8f14-ee4123233-e000.snappy.parquet")

    ddf = dd.read_parquet(file_path, engine='pyarrow').head(100)

    print(f"ESTE ES EL FILE REDUCIDO: \n\n{ddf}")

    client.close()

if __name__ == '__main__':
    main()

I´ve tried using pandas and dask as well, and tried the fastparquet engine too. The file is 1.9 GB (I have 50 of them to process) and my pc has 8 gb of RAM.

eglease
  • 2,445
  • 11
  • 18
  • 28
mdein
  • 1

0 Answers0