0

The file is relatively long (around 3MB), so it's not something that can be done manually and the amount of text in it can amount to probably more than a thousand lines scattered all over it (and there are line breaks too, so the text is properly formatted). I have no indication of formatting in regards to where a byte section ends and where the text section starts (the text is in bytes too, this isn't a txt file), aside from a chunk of text being surrounded by bytes and then there being another chunk of text. Deleting all non-ASCII characters in notepad++ does remove a good portion of it, but there is still a whole bunch of other stuff left out.

Preferred language is Python.

Hormoz
  • 291
  • 1
  • 2
  • 7

1 Answers1

0

Open the file with the encoding which seems to match contents (probably utf8) and just ignore all errors:

with open("my_file", encoding="utf8", errors="ignore") as f:
   for i, line in enumerate(f, 1):
       # do something with line

See UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line? for more information.

Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366