-1

I am trying to read THIS file, which has some strange characters in it. Opening the file in Notepad++ results in them being replaced by the "sub" character

The contents of the file are:

>>> open('test.txt', 'rb').read()
b'the first line\r\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\r\nthe third line\r\neverything\r\nafter\r\nthe\r\nfourth\r\nline'

I am using Python with a simple code

with open('test.txt') as f:
    for line in f:
        print line

which results in the program completely ignoring everything after the first sub character. It does not print out the third line and any other line at all.

My question now is two-fold:

  1. What exactly are the unknown characters in the file?
  2. What is the best way to read the file with these strange characters?

EDIT:

As far as I understand, the problem comes from the character \x1a, which is, according to this question, the "end of file character". That explains the fact that python simply stops reading the file when it encounters them, and means that my question is now:

How can I, using Python, read a file that contains the escape character U+001A in the middle without Python interpreting it as end of file?

Community
  • 1
  • 1
5xum
  • 5,250
  • 8
  • 36
  • 56
  • Please put the file data **here**. A sample can be provided by using `print repr(f.read()[:100])`. Even then, guess-the-codec is not really a suitable game for Stack Overflow posts. And I most certainly won't be downloading an externally hosted random file where the site demands I enter an email address before I can access it. – Martijn Pieters Jan 07 '15 at 15:16
  • Did this file happen to come from a windows environment over to a linux environment? – user2097159 Jan 07 '15 at 15:17
  • @MartijnPieters f.read() stops reading the file after the unknown character. I cannot copy the file in any way in which it would be readable. – 5xum Jan 07 '15 at 15:18
  • @MartijnPieters I used one of the file upload sites suggested http://meta.stackexchange.com/questions/4637/please-add-a-system-to-allow-file-uploads-attached-to-questions-and-answers here. If you have a better suggestion about how I can convey the file, please tell me... – 5xum Jan 07 '15 at 15:20
  • @user2097159 No, the file was made and is read in windows. As far as I know, it's a UTF-8 file. – 5xum Jan 07 '15 at 15:20
  • As an alternative to filehosting.org, I suggest pastebin. – Kevin Jan 07 '15 at 15:21
  • @5xum: if this is Python 3, make sure you are opening the file in *binary* mode. `open(filename, 'rb')`. If it still cannot be read, then the filesystem is corrupt and the OS can not even give Python the data in it. – Martijn Pieters Jan 07 '15 at 15:23
  • @5xum: the hosted file is a) not accessible without jumping through dodgy hoops (what will they do with my email afterwards?), and b) your question needs to be useful for future visitors too, but the file hoster may or may not still be here in a year or 5 years time. – Martijn Pieters Jan 07 '15 at 15:25
  • @MartijnPieters I edited the file location, it is now accesible via dropbox with no additional hoops to jump through... As I already said, if I *could* copy the file contents, I *would*. Also, I think my question now fully constitutes a minimal, complete and verifiable example. – 5xum Jan 07 '15 at 15:26
  • @Kevin Thank you for the suggestion, but as I already said, pasting the file has proven impossible. I now put the file up to my dropbox folder and is available for direct download. – 5xum Jan 07 '15 at 15:29
  • No, the data is not UTF-8 encoded. – Martijn Pieters Jan 07 '15 at 15:30
  • I downloaded the file and I can't reproduce your problem. – khelwood Jan 07 '15 at 15:30
  • @khelwood On Windows or Linux? – 5xum Jan 07 '15 at 15:30
  • @MartijnPieters Thank you for letting me know. So the proper course of action now is to go murder whoever gave me this monster of a file? – 5xum Jan 07 '15 at 15:32
  • @5xum: unless they can explain why they have U+0006 and U+001A control codes in the file, that sounds like an appropriate retribution. – Martijn Pieters Jan 07 '15 at 15:35
  • @5xum On a Mac. I should have mentioned that. – khelwood Jan 07 '15 at 15:35
  • @MartijnPieters Do you know how the problem can be avoided? Can Python read files with U+001A in the middle of them and not stop adter hitting them? – 5xum Jan 07 '15 at 15:39
  • Looks like `read` will give you the whole file, although if you want individual lines, you'll have to `split` on newlines yourself. – Kevin Jan 07 '15 at 15:41
  • @Kevin But `read` will only give me the whole file if I open the file in binary mode, meaning that huge other parts of my file containing non-ascii (but legit) characters will have to be re-parsed)... – 5xum Jan 07 '15 at 15:46
  • 1A is the [SUB or *Soft EOF* character](http://en.wikipedia.org/wiki/Substitute_character) in ASCII. I *think* that *Windows* may indeed try and honour that. On Mac or Linux it won't terminate reading the file. And even then you can use `file.seek()` to get past the character and read the data anyway. – Martijn Pieters Jan 07 '15 at 16:01
  • Related, possible dupe: [cannot read ascii character 26?](http://stackoverflow.com/q/20786907) and [How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?](http://stackoverflow.com/q/20695336), [Reading lines beyond SUB in Python](http://stackoverflow.com/q/9520592) – Martijn Pieters Jan 07 '15 at 16:03

1 Answers1

1

I'm on Windows. Interestingly, Python 3.3 reads the file fine in both binary and text mode, but text mode decodes to Unicode and probably reads the file in binary mode under the covers:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open('test.txt','rb').read()
b'the first line\r\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\r\nthe third line\r\neverything\r\nafter\r\nthe\r\nfourth\r\nline'
>>> open('test.txt','r').read()
'the first line\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\nthe third line\neverything\nafter\nthe\nfourth\nline'

On Python 2.7, however, it does stop at the \x1a:

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open('test.txt','rb').read()
'the first line\r\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\r\nthe third line\r\neverything\r\nafter\r\nthe\r\nfourth\r\nline'
>>> open('test.txt','r').read()
'the first line\nsomething something \x06d '

The only other difference between text and binary mode is \r\n is converted to \n, so if you still want that translation but not stop on \x1a, read the file in binary and do the replace yourself:

>>> open('test.txt','rb').read().replace('\r\n','\n')
'the first line\nsomething something \x06d \x1a Rd<br>+ \x1a Rd;;\x06d \x1a Rd<br>+ \x1a\nthe third line\neverything\nafter\nthe\nfourth\nline'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251