Is it possible to use readlines with boto3?

Question

I'm trying to run a diff on two files that are stored in S3, and would like to avoid downloading the files if possible.

The sample code I am working with is as so:

import difflib

file1 = open('sample1.csv', 'r');
file2 = open('sample2.csv', 'r');

diff = difflib.ndiff(file1.readlines(), file2.readlines())

I see with boto3 package, I can open the file from S3, but how can I pass the equivalent of file1.readlines() and file2.readlines() into the ndiff function?

Won't that involve storing in memory? My concern is when the file is 5GB, I won't be able to run a diff if I'm trying to build my own array of lines to feed into difflib. — john, Jan 17 '18 at 17:52
Annnd looking at the `difflib` docs and doing some experiments, it seems that list of strings are required. Well, at least something with a `__len__`, which you might be able to monkey-patch, but if it assumes a `list` as per the docs, it might require other methods available on lists that wouldn't be so easy to duck-type onto a lazy iterable. — juanpa.arrivillaga, Jan 17 '18 at 18:07
Possible duplicate of [How to get the file diff between two S3 buckets?](https://stackoverflow.com/questions/45513538/how-to-get-the-file-diff-between-two-s3-buckets) — Kannaiyan, Jan 17 '18 at 23:13
You cannot take a diff of two s3 objects without downloading them. Reference: https://stackoverflow.com/questions/40138780/how-to-compare-versions-of-an-amazon-s3-object — Kannaiyan, Jan 17 '18 at 23:16

score 0 · Answer 1 · answered May 13 '18 at 21:34

For future readers, I'll answer the exact question "Is it possible to use readlines with boto3?"

import io

// import stuff and set up s3_client

body = s3_client.get_object(Bucket=bucket, Key=key)['Body']
stream = io.BufferedReader(body._raw_stream)
stream.readlines()

As indicated by comments on the question, readlines() pulls everything into memory, which is why you can pass a hint to it so "no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint." (https://docs.python.org/2/library/io.html#io.IOBase.readlines)

Is it possible to use readlines with boto3?

1 Answers1