2

I count number of rows (lines) in a file using Python in the following method:

n = 0
for line in file('input.txt'):
   n += 1
print n

I run this script under Windows.

Then I count the number of rows in the same file using Unix command:

wc -l input.txt

Counting with Unix command gives a significantly larger number of rows.

So, my question is: Why Python does not see all the rows in the file? Or is it a question of definition?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Roman
  • 124,451
  • 167
  • 349
  • 456
  • 4
    Perhaps your file contains EOF markers? Those are a real pain on Windows. – Martijn Pieters Apr 19 '16 at 19:18
  • 1
    See [How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?](http://stackoverflow.com/q/20695336) – Martijn Pieters Apr 19 '16 at 19:20
  • Strange... Never seen this before. – linusg Apr 19 '16 at 19:25
  • 1
    Can you verify which is correct? Python or the unix command? Careful that you did not use a capital L I.E. `wc -L `, which gives the length of the longest line, not the number of lines (which could make sense if it's significantly larger) – sytech Apr 19 '16 at 19:35
  • `wc` seems to be correct (if I judge by the file size). I use small l. So, it is really number of rows. – Roman Apr 19 '16 at 19:57
  • Then open the file in binary mode (and count newline characters as you read blocks) or use `import io; for line in io.open('input.txt'):` which I strongly suspect is not going to fall for EOF. – Martijn Pieters Apr 19 '16 at 20:51

1 Answers1

1

You most likely have a file with one or more DOS EOF (CTRL-Z) characters in it, ASCII codepoint 0x1A. When Windows opens a file in text mode, it'll still honour the old DOS semantics and end a file whenever it reads that character. See Line reading chokes on 0x1A.

Only by opening a file in binary mode can you bypass this behaviour. To do so and still count lines, you have two options:

  • read in chunks, then count the number of line separators in each chunk:

    def bufcount(filename, linesep=os.linesep, buf_size=2 ** 15):
        lines = 0
        with open(filename, 'rb') as f:
            last = ''
            for buf in iter(f.read, ''):
                lines += buf.count(linesep)
                if last and last + buf[0] == linesep:
                    # count line separators straddling a boundary
                    lines += 1
                if len(linesep) > 1:
                    last = buf[-1]
        return lines
    

    Take into account that on Windows os.linesep is set to \r\n, adjust as needed for your file; in binary mode line separators are not translated to \n.

  • Use io.open(); the io set of file objects open the file in binary mode always, then do the translations themselves:

    import io
    
    with io.open(filename) as f:
        lines = sum(1 for line in f)
    
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343