I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage.
Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder.
I am aware of the method download_fileobj
but could not determine whether it would result in out of memory
error while downloading large files of sizes ~= 10GB.
Asked
Active
Viewed 8,993 times
4

aviral sanjay
- 953
- 2
- 14
- 31
4 Answers
5
I would recommend using download_file()
:
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
It will not run out of memory while downloading. Boto3 will take care of the transfer process.

John Rotenstein
- 241,921
- 22
- 380
- 470
0
You can use the awscli
command line for this. Stream the output as follows:
aws s3 cp s3://<bucket>/file.txt -
The above command will stream the file contents in the terminal. Then you can use split
and/or tee
commands to create file chunks.
Example: aws s3 cp s3://<bucket>/file.txt - | split -d -b 100000 -
More details in this answer: https://stackoverflow.com/a/7291791/2732674

Varun Chandak
- 943
- 1
- 8
- 25
0
You can increase the bandwidth usage by making concurrent S3 API transfer calls
config = TransferConfig(max_concurrency=150)
s3_client.download_file(
Bucket=s3_bucket,
Filename='path',
Key="key",
Config=config
)

Shady Smaoui
- 867
- 9
- 11
-1
You can try boto3 s3.Object api.
import boto3
s3 = boto3.resource('s3')
object = s3.Object('bucket_name','key')
body = object.get()['Body'] #body returns streaming string
for line in body:
print line

raghavyadav990
- 82
- 5
-
That would cause trouble as sometimes in CSV files, in one row there can be newline characters which pandas can take care of but streaming line by line cannot. – aviral sanjay Jan 16 '19 at 09:06
-
Never encounter such a scenario, I think it could also take for that. Try forming a CSV with this text. a,b C \n,d – raghavyadav990 Jan 19 '19 at 12:49
-
Yeah, I faced the issue and hence stating the above experience. The point to be noted is that row != line. – aviral sanjay Jan 19 '19 at 13:06