-2

I think the error is in the read function. It cannot read beyond the special character in the image See repr output

I have using string.find() in python as follows:

indexOfClosedDoc = temp.find("</DOC>",indexOfOpenDoc)

However, when the string has text as below:

SUB
</DOC>

where SUB is a special character, temp.find cannot find the tag. Any suggestions on how to fix this

Example:

enter image description here

Code that causes it to fail:

handle = open("error.txt",'r');
temp = handle.read();
index = temp.find("</DOC>",0)
if(index == -1):
    print "Error"
    exit(1)

Put the image text in a text file and run the code

Here is repr of the temp variable for the text in the example. The text in eror.txt is everything from line 29722 in the image

' </P>\n\n'

NOTE: The read() function never read beyond SUB so finding is out of the question

Programmer
  • 6,565
  • 25
  • 78
  • 125
  • 5
    Please give an example of data that causes it to fail. What "special characters" cause this? – BrenBarn Aug 23 '12 at 03:49
  • What do you mean by *special character*, exactly? – behnam Aug 23 '12 at 03:50
  • Please look at attached image – Programmer Aug 23 '12 at 03:54
  • 5
    could you show the result of: `print(repr(temp[-60:]))`? – jfs Aug 23 '12 at 03:57
  • why -60 becuase the image is not the whole document i am parsing – Programmer Aug 23 '12 at 03:59
  • 2
    Could you post a code snippet that causes it to fail? Something that could actually be run... – Jeff Tratner Aug 23 '12 at 04:00
  • 1
    is the actual string "\x1a\n" or there are other characters in it? Because "\x1a\n".find("") returns 2 as expected. – Dmitry B. Aug 23 '12 at 04:06
  • @Dmitry: The text file is exactly as the image. If you put the image text in a text file called "error.txt" and run the provided python code, you should see the error – Programmer Aug 23 '12 at 04:09
  • the image won't show us any unprintable characters that might be in the data throwing off find(). – Dmitry B. Aug 23 '12 at 04:15
  • I think the problem is with the read() function because it seems it cannot read beyond the 'SUB' – Programmer Aug 23 '12 at 04:16
  • What platform and version of Python are you using? Can you please post a repr() of the temp variable? – nneonneo Aug 23 '12 at 04:23
  • @nneonneo: Done. Please read the note as well – Programmer Aug 23 '12 at 04:26
  • Still can't reproduce. How big is the file on disk? Does it correspond to the number of bytes you think should be in the file (i.e. is your text editor correctly saving the file?) Again, what platform and Python version are you using? – nneonneo Aug 23 '12 at 04:29
  • Python version 2.6.6 and windows. What do you mean by can't reproduce – Programmer Aug 23 '12 at 04:30
  • try opening the file with 'rb' binary mode rather than text mode. From the docs - "On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files" – Tim Hoffman Aug 23 '12 at 04:33
  • Duplicate of http://stackoverflow.com/questions/405058/line-reading-chokes-on-0x1a – nneonneo Aug 23 '12 at 04:33

3 Answers3

2

The answer is to open the file using 'rb' mode. On Windows, opening the file with just 'r' will cause it to use the old DOS behaviour of stopping at 0x1A (a DOS EOF character). See also Line reading chokes on 0x1A

Community
  • 1
  • 1
nneonneo
  • 171,345
  • 36
  • 312
  • 383
0

Note: if the file uses a multibyte encoding then the .find() won't work even if there is no 0x1A in it e.g.:

import codecs

with codecs.open('file.utf16', 'w', encoding='utf-16') as file:
    file.write(u"abcd") # write a string using utf-16 encoding

#XXX incorrect code don't use it
with open('file.utf16', 'r') as f:
    temp = f.read()
    i = temp.find('bc')
    print i #XXX -> -1 not found

with open('file.utf16', 'rb') as f:
    temp = f.read()
    i = temp.find('bc')
    print i #XXX -> -1 not found

# works
with codecs.open('file.utf16', encoding='utf-16') as f:
    temp = f.read()
    i = temp.find('bc')
    print i # -> 1 found
jfs
  • 399,953
  • 195
  • 994
  • 1,670
-1

check your indexOfOpenDoc value, I doubt it is larger than the location appears.

Wei
  • 718
  • 1
  • 6
  • 18