1

I have a BytesIO buffer into which is being written the contents of a file in S3. That file has pybson BSON objects written to it, separated by \n characters, i.e. binary characters separated by new lines.

I want to parse the event objects in the file. I am iterating through each line like this:

def iter_event(data: BytesIO):
    for line in data:
        yield bson.loads(line)

I am finding that there seem to be some rogue characters being injected or corrupted at the end of the line variable in some cases and my code is failing with the same exception as mentioned briefly in one of the comments in this SO question. When I look at the file using a binary editor I cannot see the rogue character, it seems to only occur in the line variable. (For what it's worth, the end of the BSON object looks like \x00\x00\n in a binary editor and my line variable ends in \x00\x10e\n.

Is there an issue with iterating through each line like this? If not, what's a better approach please?

John
  • 10,837
  • 17
  • 78
  • 141
  • 1
    Do you have a short but complete program to share? Maybe the bug is somewhere else, not in BytesIO. – pts Mar 14 '23 at 20:00
  • For debugging, you may want to `print(type(data))`, `print(type(line))` (it should be `bytes`, not `str`) and `print(len(line))`. Please post the debug output. – pts Mar 14 '23 at 20:03
  • 1
    "BSON objects separated by `\n` characters" is a completely unworkable data format - most of the BSON data types can contain a newline character in their representation. But you don't actually need a separator, as BSON data starts with a length, which could be used to unambiguously split up a stream into the original objects. I'm not familiar with the library you're using, so I can't suggest how you could best implement that. – jasonharper Mar 14 '23 at 20:21
  • BSON is a binary format. So append a BSON object to the file with a header which includes the size of the body BSON bytes. Then parse the file sequentially. It's easy with the standard ```struct``` module. – relent95 Mar 15 '23 at 01:37

0 Answers0