I am trying to read THIS file, which has some strange characters in it. Opening the file in Notepad++ results in them being replaced by the "sub" character
The contents of the file are:
>>> open('test.txt', 'rb').read()
b'the first line\r\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\r\nthe third line\r\neverything\r\nafter\r\nthe\r\nfourth\r\nline'
I am using Python with a simple code
with open('test.txt') as f:
for line in f:
print line
which results in the program completely ignoring everything after the first sub character. It does not print out the third line and any other line at all.
My question now is two-fold:
- What exactly are the unknown characters in the file?
- What is the best way to read the file with these strange characters?
EDIT:
As far as I understand, the problem comes from the character \x1a
, which is, according to this question, the "end of file character". That explains the fact that python simply stops reading the file when it encounters them, and means that my question is now:
How can I, using Python, read a file that contains the escape character U+001A in the middle without Python interpreting it as end of file?