I am working on a ETL project. For that I am trying to read a .parquet file in order to see, transform the data and upload it.
I´ve been failing with that as I always get an "OOM error" while reading it.
Is there some way I could read this locally?
This is my code currently:
import dask.dataframe as dd
import os
from dask.distributed import Client
def main():
client = Client()
print(f"Dashboard link: {client.dashboard_link}")
current_dir = os.getcwd()
file_path = os.path.join(current_dir, "part-00000-4333534a-3d5-41162-8f14-ee4123233-e000.snappy.parquet")
ddf = dd.read_parquet(file_path, engine='pyarrow').head(100)
print(f"ESTE ES EL FILE REDUCIDO: \n\n{ddf}")
client.close()
if __name__ == '__main__':
main()
I´ve tried using pandas and dask as well, and tried the fastparquet engine too. The file is 1.9 GB (I have 50 of them to process) and my pc has 8 gb of RAM.