0

I'm trying to run a diff on two files that are stored in S3, and would like to avoid downloading the files if possible.

The sample code I am working with is as so:

import difflib

file1 = open('sample1.csv', 'r');
file2 = open('sample2.csv', 'r');

diff = difflib.ndiff(file1.readlines(), file2.readlines())

I see with boto3 package, I can open the file from S3, but how can I pass the equivalent of file1.readlines() and file2.readlines() into the ndiff function?

john
  • 33,520
  • 12
  • 45
  • 62
  • Get the string and split on line breaks? – juanpa.arrivillaga Jan 17 '18 at 17:44
  • Won't that involve storing in memory? My concern is when the file is 5GB, I won't be able to run a diff if I'm trying to build my own array of lines to feed into difflib. – john Jan 17 '18 at 17:52
  • yes, but so does `.readlines` – juanpa.arrivillaga Jan 17 '18 at 18:00
  • Annnd looking at the `difflib` docs and doing some experiments, it seems that list of strings are required. Well, at least something with a `__len__`, which you might be able to monkey-patch, but if it assumes a `list` as per the docs, it might require other methods available on lists that wouldn't be so easy to duck-type onto a lazy iterable. – juanpa.arrivillaga Jan 17 '18 at 18:07
  • Possible duplicate of [How to get the file diff between two S3 buckets?](https://stackoverflow.com/questions/45513538/how-to-get-the-file-diff-between-two-s3-buckets) – Kannaiyan Jan 17 '18 at 23:13
  • You cannot take a diff of two s3 objects without downloading them. Reference: https://stackoverflow.com/questions/40138780/how-to-compare-versions-of-an-amazon-s3-object – Kannaiyan Jan 17 '18 at 23:16

1 Answers1

0

For future readers, I'll answer the exact question "Is it possible to use readlines with boto3?"

import io

// import stuff and set up s3_client

body = s3_client.get_object(Bucket=bucket, Key=key)['Body']
stream = io.BufferedReader(body._raw_stream)
stream.readlines()

As indicated by comments on the question, readlines() pulls everything into memory, which is why you can pass a hint to it so "no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint." (https://docs.python.org/2/library/io.html#io.IOBase.readlines)

Zachary Ryan Smith
  • 2,688
  • 1
  • 20
  • 30