1

Here's the issue I'm running into:

Error: iterator should return strings, not bytes (did you open the file in text mode?)

The code that's causing this looks something like:

for fileinfo in tarfile.open(filename):
    f = t.extractfile(fileinfo)
    reader = csv.DictReader(f)
    reader.fieldnames

The trouble seems to be that the extractfile() method produces a io.BufferedReader that is a very basic file-like object and has no high-level text interface.

What would be a good way to handle this?

I'm thinking of looking at decoding the bytes from the reader into text but I need to retain streaming because these files are very large. The codebase is Python 3.6 running on Docker/Linux.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112
  • 2
    I'm too lazy to tar a csv file and post a complete and tested solution, but you should take a look at [`io.TextIOWrapper`](https://docs.python.org/3/library/io.html#io.TextIOWrapper). – Aran-Fey Oct 02 '18 at 21:08
  • 2
    Can't you just wrap it as a text stream using the [`codecs`](https://docs.python.org/3/library/codecs.html) module? Something like `codecs.getreader("utf-8")(t.extractfile(fileinfo))`? – zwer Oct 02 '18 at 21:12

1 Answers1

0

Thanks to both @Aran-Fey and @zwer who led me to another StackOverflow question that answered it. Here's how:

for fileinfo in tarfile.open(filename):
    with t.extractfile(fileinfo) as f:
        ft = codecs.getreader("utf-8")(f)
        reader = csv.DictReader(ft)
        reader.fieldnames

This seems to work so far.

Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112