I'm attempting to write a Python 2/3 compatible routine to fetch a CSV file, decode it from latin_1
into Unicode and feed it to a csv.DictReader
in a robust, scalable manner.
- For Python 2/3 support, I'm using
python-future
including imporingopen
frombuiltins
, and importingunicode_literals
for consistent behaviour - I'm hoping to handle exceptionally large files by spilling to disk, using
tempfile.SpooledTemporaryFile
- I'm using
io.TextIOWrapper
to handle decoding from thelatin_1
encoding before feeding toDictReader
This all works fine under Python 3.
The problem is that TextIOWrapper
expects to wrap a stream which conforms to BufferedIOBase
. Unfortunately under Python 2, although I have imported the Python 3-style open
, the vanilla Python 2 tempfile.SpooledTemporaryFile
still of course returns a Python 2 cStringIO.StringO
, instead of a Python 3 io.BytesIO
as required by TextIOWrapper
.
I can think of these possible approaches:
- Wrap the Python 2
cStringIO.StringO
as a Python 3-styleio.BytesIO
. I'm not sure how to approach this - would I need to write such a wrapper or does one already exist? - Find a Python 2 alternative to wrap a
cStringIO.StringO
stream for decoding. I haven't found one yet. - Do away with
SpooledTemporaryFile
, decode entirely in memory. How big would the CSV file need to be for operating entirely in memory to become a concern? - Do away with
SpooledTemporaryFile
, and implement my own spill-to-disk. This would allow me to callopen
from python-future, but I'd rather not as it would be very tedious and probably less secure.
What's the best way forward? Have I missed anything?
Imports:
from __future__ import (absolute_import, division,
print_function, unicode_literals)
from builtins import (ascii, bytes, chr, dict, filter, hex, input, # noqa
int, map, next, oct, open, pow, range, round, # noqa
str, super, zip) # noqa
import csv
import tempfile
from io import TextIOWrapper
import requests
Init:
...
self._session = requests.Session()
...
Routine:
def _fetch_csv(self, path):
raw_file = tempfile.SpooledTemporaryFile(
max_size=self._config.get('spool_size')
)
csv_r = self._session.get(self.url + path)
for chunk in csv_r.iter_content():
raw_file.write(chunk)
raw_file.seek(0)
text_file = TextIOWrapper(raw_file._file, encoding='latin_1')
return csv.DictReader(text_file)
Error:
...in _fetch_csv
text_file = TextIOWrapper(raw_file._file, encoding='utf-8')
AttributeError: 'cStringIO.StringO' object has no attribute 'readable'