string.find() in python cannot handle special characters

Question

I think the error is in the read function. It cannot read beyond the special character in the image See repr output

I have using string.find() in python as follows:

indexOfClosedDoc = temp.find("</DOC>",indexOfOpenDoc)

However, when the string has text as below:

SUB
</DOC>

where SUB is a special character, temp.find cannot find the tag. Any suggestions on how to fix this

Example:

enter image description here

Code that causes it to fail:

handle = open("error.txt",'r');
temp = handle.read();
index = temp.find("</DOC>",0)
if(index == -1):
    print "Error"
    exit(1)

Put the image text in a text file and run the code

Here is repr of the temp variable for the text in the example. The text in eror.txt is everything from line 29722 in the image

' </P>\n\n'

NOTE: The read() function never read beyond SUB so finding is out of the question

Please give an example of data that causes it to fail. What "special characters" cause this? — BrenBarn, Aug 23 '12 at 03:49
why -60 becuase the image is not the whole document i am parsing — Programmer, Aug 23 '12 at 03:59
Could you post a code snippet that causes it to fail? Something that could actually be run... — Jeff Tratner, Aug 23 '12 at 04:00
is the actual string "\x1a\n" or there are other characters in it? Because "\x1a\n".find("") returns 2 as expected. — Dmitry B., Aug 23 '12 at 04:06
@Dmitry: The text file is exactly as the image. If you put the image text in a text file called "error.txt" and run the provided python code, you should see the error — Programmer, Aug 23 '12 at 04:09
the image won't show us any unprintable characters that might be in the data throwing off find(). — Dmitry B., Aug 23 '12 at 04:15
I think the problem is with the read() function because it seems it cannot read beyond the 'SUB' — Programmer, Aug 23 '12 at 04:16
What platform and version of Python are you using? Can you please post a repr() of the temp variable? — nneonneo, Aug 23 '12 at 04:23
Still can't reproduce. How big is the file on disk? Does it correspond to the number of bytes you think should be in the file (i.e. is your text editor correctly saving the file?) Again, what platform and Python version are you using? — nneonneo, Aug 23 '12 at 04:29
Python version 2.6.6 and windows. What do you mean by can't reproduce — Programmer, Aug 23 '12 at 04:30
try opening the file with 'rb' binary mode rather than text mode. From the docs - "On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files" — Tim Hoffman, Aug 23 '12 at 04:33
Duplicate of http://stackoverflow.com/questions/405058/line-reading-chokes-on-0x1a — nneonneo, Aug 23 '12 at 04:33

score 2 · Accepted Answer · edited May 23 '17 at 12:20

2

The answer is to open the file using 'rb' mode. On Windows, opening the file with just 'r' will cause it to use the old DOS behaviour of stopping at 0x1A (a DOS EOF character). See also Line reading chokes on 0x1A

edited May 23 '17 at 12:20

Community

1
1

answered Aug 23 '12 at 04:34

nneonneo

171,345
36
312
383

Thanks for the answer but why did I get -2. I think it was a good question. – Programmer Aug 23 '12 at 05:43

score 0 · Answer 2 · answered Aug 23 '12 at 04:41

Note: if the file uses a multibyte encoding then the .find() won't work even if there is no 0x1A in it e.g.:

import codecs

with codecs.open('file.utf16', 'w', encoding='utf-16') as file:
    file.write(u"abcd") # write a string using utf-16 encoding

#XXX incorrect code don't use it
with open('file.utf16', 'r') as f:
    temp = f.read()
    i = temp.find('bc')
    print i #XXX -> -1 not found

with open('file.utf16', 'rb') as f:
    temp = f.read()
    i = temp.find('bc')
    print i #XXX -> -1 not found

# works
with codecs.open('file.utf16', encoding='utf-16') as f:
    temp = f.read()
    i = temp.find('bc')
    print i # -> 1 found

score -1 · Answer 3 · answered Aug 23 '12 at 04:19

-1

check your indexOfOpenDoc value, I doubt it is larger than the location appears.

answered Aug 23 '12 at 04:19

Wei

718
1
6
18

No. The problem is in the read function as it cannot read beyong sub – Programmer Aug 23 '12 at 04:23

string.find() in python cannot handle special characters

3 Answers3