I have this non-text file that has a bunch of bytes and some text, how do I go about separating the text cleanly from the rest?

Question

The file is relatively long (around 3MB), so it's not something that can be done manually and the amount of text in it can amount to probably more than a thousand lines scattered all over it (and there are line breaks too, so the text is properly formatted). I have no indication of formatting in regards to where a byte section ends and where the text section starts (the text is in bytes too, this isn't a txt file), aside from a chunk of text being surrounded by bytes and then there being another chunk of text. Deleting all non-ASCII characters in notepad++ does remove a good portion of it, but there is still a whole bunch of other stuff left out.

Preferred language is Python.

score 0 · Answer 1 · answered Feb 28 '21 at 15:58

Open the file with the encoding which seems to match contents (probably utf8) and just ignore all errors:

with open("my_file", encoding="utf8", errors="ignore") as f:
   for i, line in enumerate(f, 1):
       # do something with line

See UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line? for more information.

I have this non-text file that has a bunch of bytes and some text, how do I go about separating the text cleanly from the rest?

1 Answers1