1

I have python 2.7 program used on a unix server that reads in an ASCII file with two types of information and processes that information. I have put this process into a function that essentially does:

def read_info()
    f = open(file_name, 'rb')
    f_enumerator = enumerate(f, start=1)
    for i, line in f_enumerator:
        process_info
    process_last_info

When this function is called on the file from my main program, it stops at a seemingly arbitrary point halfway through a line towards the end of the input file, whereas when the function is called from a simple wrapper on the same input file it reads the entire file correctly.

I have tried one of the solutions here: Python Does Not Read Entire Text File , Where the file is read in as binary but that did not fix the problem. The other solution there (reading in the file in chunks) would be problematic as I am trying to parse the file on a line-specific basis, and reading in a chunks of text would require a lot more parsing.

I would be willing to do that, except that the intermittent nature of the problem suggests to me that there might be some other solution?

Dan
  • 175
  • 1
  • 12
  • 3
    Can you clarify a couple of things. Is this a text file, and if so do you get the same issue without the `'b'` (you imply you do, just checking). Is the file a Windows file, and which OS are you running on? The iteration does a read-ahead, which could explain the apparent random-ness. – cdarke May 25 '15 at 17:06
  • 1
    Also, which version of Python do you use? – Cilyan May 25 '15 at 17:10
  • @Dan: since you found the problem cause, maybe you can post your solution, along with your considerations for the `with` usage as an answer - it could therefore be accepted as the right answer for the question. – jsbueno May 25 '15 at 17:35
  • Thanks for the responses, I realized it was a simple mistake on my part and I've edited accordingly. In answer to the questions: It is a text file. In a similar problem ( http://stackoverflow.com/questions/9905874/python-does-not-read-entire-text-file ) it was suggested that reading in as binary may fix the problem which is why I changed the mode to 'rb'. Details: Python 2.7, Ubuntu Server. – Dan May 25 '15 at 17:37

3 Answers3

4

On further reflection I realized it was because I had created the file earlier in the program and had not closed the file handle, and this was therefore perhaps a buffering issue. Closing the file earlier fixed the problem.

It was suggested that I use "with" syntax for writing to the file originally:

with open(file_name, 'w') as f:
    do foo

This would indeed have prevented me from forgetting to close the file, and prevented this problem.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
Dan
  • 175
  • 1
  • 12
  • actually, re-opening the file for read using with won't change your original porblem. Oppening it for writting using with would do, though. – jsbueno May 25 '15 at 17:42
0
def read_info():
    with open(file_name, 'rb') as f:
       for i, a_line in enumerate(f,1): #a_line must end with a newline
            process_info(a_line,i)
    # you have processed whole file here so no need for `process_last_info`

using with will ensure your filehandle is closed (you should especially do this when writing to a file but really its always good practice)...

on further information from OP I believe a generator would be an ideal solution to his problem

def data_gen(f):
   header = None
   lines = []
   for line in f:
       if line.startswith(">"): #header
          if header is not None: #if its not the first line basically
             yield header,lines 
          header = line #set the header
          lines = [] #reinitialize lines
       else:
          lines.append(line)
    yield header,lines # the last section

def read_info(fname):
    with open(fname,"rb") as f:
        for header,lines in data_gen(f):
            process(header,lines)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Why is with open() as f: superior to f = open() and f.close()? In the full version of my code process_last_info is indeed necessary, based on the way the two types of information are stored sequentially. – Dan May 25 '15 at 17:18
  • `with` guarantees the filehandle will be closed ... even if you encounter errors ... why wouldnt you have a line outside of read_info do your sequentially things ... this sounds like an XY problem... what are you actually trying to do... do you want to go through the file 2 lines at a time? – Joran Beasley May 25 '15 at 17:28
  • -1 for a "I have no idea" syntactically incorrect answer, keeping the only explicit problem in the OP code unchanged and uncomented - namely opening a text file in the `'rb'` mode. – jsbueno May 25 '15 at 17:30
  • 1
    I see, in answering my own question perhaps I've demonstrated the exact utility of the "with" syntax. The file read in a title line followed by a variable number of data lines, with the data being processed when the next title line is reached or the file ends. – Dan May 25 '15 at 17:31
  • @jsbueno its syntactically correct ... and there is no error with opening a textfile as `rb` ... he never said it only contained ascii characters – Joran Beasley May 25 '15 at 17:32
  • @Dan I believe a generator would be ideal for this problem ... how do you know how many lines to read from the header(title?) and what does the title actually look like – Joran Beasley May 25 '15 at 17:33
  • 1
    It did indeed contain only ascii characters, but the other question/solution suggested reading in as binary may have addressed an early, mistaken reading of EOF. – Dan May 25 '15 at 17:33
  • The syntax error is still there :-) . And opening a file in "rb" for reading as text is _semantically_ incorrect. Although it is not Python's behavior, interating over a file open in binary mode should yield each byte as an element of the file (not each text line) – jsbueno May 25 '15 at 17:38
  • (btw, Since I didn't produe an answer, rather just posted the fix in a visible way for other people hitting here, I marked my post as "community wiki" - maye you should take another action in a wiki answer than downvoting it for petty revenge). While you are at it, just add the ":" missing to fix your syntax so I can remove my downvote. – jsbueno May 25 '15 at 17:40
  • ok I fixed it .... you were right i overlooked that ... if you edit your answer I guess I will remove the downvote ... – Joran Beasley May 25 '15 at 17:42
  • @JoranBeasley Thank you for posting the generator. However doesn't it require a fixed number of lines of data? The number of data lines (aka continuous data string broken by newlines) is variable. – Dan May 25 '15 at 17:51
  • you would get the N_LINES from the header I assumed? thats why I asked what your headers looked like :P – Joran Beasley May 25 '15 at 17:53
  • @JoranBeasley Unfortunately no. Oh sorry I did not see that question. The header can, but is not required to, contain the length of the subsequent data string. Most often it's ">[Name]" and sometimes ">[Name and Data]". – Dan May 25 '15 at 17:53
  • ok so it starts with ">" ? and regular lines dont? is there a blank line or something right before? – Joran Beasley May 25 '15 at 17:56
  • @JoranBeasley I was reading a new data header as occuring when ">" is found at the beginning of the line, as it is not allowed in that position in the data. Yes that's correct. No blank lines. – Dan May 25 '15 at 17:57
  • @JoranBeasley While this wasn't the answer to the question, this is a very nice solution to my original problem. Thank you! – Dan May 27 '15 at 14:38
0

As the O.P. found out, the problem was the file had been created previously on the same program, but had not been flushed or closed properly before the reading attempt.

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
jsbueno
  • 99,910
  • 10
  • 151
  • 209