I am using python boto to interact with s3. The files I have on s3 are CSV's where I'd like to read lines from s3 using a buffer to bound memory usage.
I was wondering if anyone had any way of composing python's io classes to achieve this? The goal is to have some sort of abstraction that is able to wrap boto Key, and provides a readline
or iterator interface over the key (which only provides a read(size=0)
call. The complexity is that since it is stored as a CSV, each row is variable length.
The goal was to have an abstraction that I was able to wrap python boto key with and then implemented iterator protocol so that I could pass it to csv reader, which I ended up implementing myself.
It looks like python io
really has all the pieces to do this BufferedReader
and TextIOWrapper
, and I fooled around with it by naively trying to pass the boto Key
to it, but BufferedReader
expected an IOBase
object.
I then implemented the IOBase protocol around the Key but got unicode errors, and just generally wasn't sure what I was doing.
Does anyone know if python io can do something similar to what's described above??
Technical specs:
There is a directory of 1-100 CSV files on s3. All have the same format, but a variable number of rows. I am trying to implement a function that takes an iterator of boto Key
s.
Key
provides a read(num_bytes)
method.
def yield_lines(keys_iterator):
# had to custom implement this
# any way using io??
# yield each CSV row across keys that only provide `read()` method
My initial attempt was to try and make boto Key
adhere to IOBase
. I would compose it with a buffered reader and then try and read lines from it using a TextIOWrapper but ran into encoding issues with readinto
.
class IOCompatibleKey(object):
def __init__(self, s3_key):
self.s3_key = s3_key
def readable(self):
return True
def writeable(self):
return False
def read(num_bytes):
return self.s3_key.read(num_bytes)
def readinto(n):
# .... ?????
buffered_reader = BufferedReader(IOCompatibleKey(s3_key))
text_reader = TextIOWrapper(buffered_reader)
for line in text_reader: # <- IS THIS POSSIBLE????
print(line)