0

I have two CSV files one is around 60 GB and other is around 70GB in S3. I need to load both the CSV files into pandas dataframes and perform operations such as joins and merges on the data.

I have an EC2 instance with sufficient amount of memory for both the dataframes to be loaded into memory at a time.

What is the best way to read that huge file from S3 to pandas dataframe?

Also after I perform the required operations on the dataframes the output dataframe should be re-uploaded to S3.

What is th best way of uploading the huge csv file to S3?

user12073121
  • 81
  • 3
  • 11
  • Did you already try `pd.read_csv('s3://foo/bar.csv')`? I am not entirely sure about the efficiency though. – Daan Klijn May 05 '20 at 20:15
  • 1
    Can you be more specific? Which part are you struggling with? – AMC May 05 '20 at 20:21
  • 1
    Related: https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk, https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas, https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – AMC May 05 '20 at 20:22
  • @AMC I wanted to know the best and efficient way of loading and writing huge amount of data between S3 and Dataframes. – user12073121 May 05 '20 at 20:24
  • Does this answer your question? [How to import a text file on AWS S3 into pandas without writing to disk](https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk) – AMC May 05 '20 at 20:27
  • I'm currently doing it the same way, but I'm puzzled if that's the most efficient way for reading and processing such a huge amount of data. Or are there better ways of doing it? – user12073121 May 05 '20 at 20:31
  • @user12073121 What Mayank Porwal suggested should be alright. If there's something very wrong, you'll notice ;) – AMC May 05 '20 at 20:34
  • You may also want to consider [Amazon Athena](https://aws.amazon.com/athena/?nc=sn&loc=0), which might be cheaper and easier to set up. – howard.h May 06 '20 at 00:38
  • Ya but isn't Athena more used for adhoc type querying than working with large data – user12073121 May 06 '20 at 04:09
  • AWSWrangler is a good option here. You can also look into different file formats, and compare their read/write time and size. CSV is pretty slow. Check this: https://stackoverflow.com/questions/60274660/fastest-option-for-reading-the-data-for-pandas-from-s3-bucket/76320538#76320538 – Rishabh Gupta May 24 '23 at 06:35

1 Answers1

2

For reading from S3, you can do:

import pandas as pd

df = pd.read_csv('s3://bucket-name/file.csv')

Then do all the joins and merges on this dataframe and upload it back to S3:

df.to_csv('s3://bucket-name/file.csv', index=False)
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
  • 1
    I'm currently doing it this way, but is it the most efficient approach, or should we be reading in chunks and combine at ec2 end – user12073121 May 05 '20 at 20:26
  • Yes, this is quite efficient and straight-forward. With earlier versions, the file had to be downloaded from s3 and then can be read in pandas, but now it could be read directly. Hard to get more efficient than this. – Mayank Porwal May 05 '20 at 20:29
  • Is pd.to_csv() better compared to multipart upload to S3? – user12073121 May 05 '20 at 20:37