6

Firstly, I understand how to write UTF-8 from strings in Python3 and that StringIO is recommended for such string building. However, I specifically need a binary file-like object and for that I need BytesIO. If I do the following then the data ends up blowing up because it gets read as Latin1, my computer's default locale/charset.

with io.StringIO() as sb:
    csv.writer(sb).writerows(rows)
    sb.flush()
    sb.seek(0)
    # blows up with Latin1 encoding error
    job = bq.load_table_from_file(sb, table_ref, job_config=job_config)

So my work-around is this monstrosity that doubles the amount of memory used:

with io.StringIO() as sb:
    csv.writer(sb).writerows(rows)
    sb.flush()
    sb.seek(0)
    with io.BytesIO(sb.getvalue().encode('utf-8')) as buffer:
        job = bq.load_table_from_file(buffer, table_ref, job_config=job_config)

Somewhere in this chain there must be a way to specify the byte-encoding so that readers of the file-like sb will see the data as UTF-8. Or is there a way to use csv.writer() with a byte stream?

I've looked for both of these answers on StackOverflow but what I've found has generally been for writing to files and for stuff in memory everything points to StringIO.

Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112
  • There must be a way to create the job directly from the `rows`. Otherwise the whole thing would be inefficient not only in terms of memory usage but also in terms of CPU usage since encoding characters in bytes and back again is quite expensive. – aventurin Nov 27 '19 at 23:09
  • Well, creating a job requires contacting the API service which means the `rows` must be serialized into a transport format such as JSON or CSV. – Neil C. Obremski Dec 01 '19 at 16:21

1 Answers1

6

There is a TextIOWrapper class which does the job but if you use a context manager with it then it will close the stream and make the original BytesIO object unusable.

Modifying my original example:

with io.BytesIO() as buffer:
    sb = io.TextIOWrapper(buffer, 'utf-8', newline='')
    csv.writer(sb).writerows(rows)
    sb.flush()
    buffer.seek(0)
    job = bq.load_table_from_file(buffer, table_ref, job_config=job_config)

Another caveat is the newline parameter which, if left alone, does translations of new-line characters. Set newline = '' to prevent this.

Neil C. Obremski
  • 18,696
  • 24
  • 83
  • 112
  • I don't think it's the context manager that's closing the stream, its the csv writer when it goes out of scope - see https://stackoverflow.com/a/48434568 – David Waterworth Mar 23 '23 at 01:31
  • 1
    The `TextIOWrapper` actually closes the stream when _it_ goes out of scope (the `csv.writer` does not close the stream). There is a `detach()` method to avoid this behavior - an unfortunate design that requires saving references to TextIOWrapper even when all you need is a temporary pass-through of string to bytes – Neil C. Obremski Mar 26 '23 at 16:14