4

I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage. Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder. I am aware of the method download_fileobj but could not determine whether it would result in out of memory error while downloading large files of sizes ~= 10GB.

aviral sanjay
  • 953
  • 2
  • 14
  • 31

4 Answers4

5

I would recommend using download_file():

import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')

It will not run out of memory while downloading. Boto3 will take care of the transfer process.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
0

You can use the awscli command line for this. Stream the output as follows:

aws s3 cp s3://<bucket>/file.txt -

The above command will stream the file contents in the terminal. Then you can use split and/or tee commands to create file chunks.

Example: aws s3 cp s3://<bucket>/file.txt - | split -d -b 100000 -

More details in this answer: https://stackoverflow.com/a/7291791/2732674

Varun Chandak
  • 943
  • 1
  • 8
  • 25
0

You can increase the bandwidth usage by making concurrent S3 API transfer calls

        config = TransferConfig(max_concurrency=150)

        s3_client.download_file(
            Bucket=s3_bucket,
            Filename='path',
            Key="key",
            Config=config
        )
Shady Smaoui
  • 867
  • 9
  • 11
-1

You can try boto3 s3.Object api.

import boto3
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')

body = object.get()['Body'] #body returns streaming string

for line in body:
    print line
  • That would cause trouble as sometimes in CSV files, in one row there can be newline characters which pandas can take care of but streaming line by line cannot. – aviral sanjay Jan 16 '19 at 09:06
  • Never encounter such a scenario, I think it could also take for that. Try forming a CSV with this text. a,b C \n,d – raghavyadav990 Jan 19 '19 at 12:49
  • Yeah, I faced the issue and hence stating the above experience. The point to be noted is that row != line. – aviral sanjay Jan 19 '19 at 13:06