0

I have the need to process a large file in Python (about 1GB) line by line. I use this approach to do it:

with open('file.txt', 'r') as f:
    i = 0
    for fline in f:
        process(fline)
        i = i + 1
print i

the value of i (the number of iteration = the number of lines of the file) is 19,991,889.

but the file (opened with EmEditor) reports that the file have 63,941,070 lines.

Why don't the number of lines match? What I'm doing wrong?

Thanks.

Levon
  • 138,105
  • 33
  • 200
  • 191
Keaire
  • 879
  • 1
  • 11
  • 30
  • What's the `process` function? – Jossie Calderon Jul 24 '16 at 20:02
  • 1
    contains your file mixed line endings (`\n`, `\r`, `\n\r`)? – Daniel Jul 24 '16 at 20:02
  • Can you use a utility, like `wc` to get another line count? By the way, you can compress your for-loop this using `enumerate` (which will give you the index value automagically) `for i, fline in enumerate(f):` - see also http://stackoverflow.com/questions/522563/accessing-the-index-in-python-for-loops. Finally, `'r'` is optional when reading files – Levon Jul 24 '16 at 20:06
  • How does `EmEditor` count/show its line numbers? – Moses Koledoye Jul 24 '16 at 20:08
  • The process function is a simple condition that check if the word from another file matches with a line of the big file. But I try to remove this condition, and insert only i = i + 1 and the result is the same. I think that my file only have `\n` (because it have a word for every line). – Keaire Jul 24 '16 at 20:12
  • This is a screen of the number of line that return EmEditor: http://prntscr.com/bwzhob. I try using enumerate, the result is the same: http://prntscr.com/bwzjkj – Keaire Jul 24 '16 at 20:18
  • Using `enumerate()` will give you the same count, it's just a more pythonic way to code this. I would try a different utility to get another line count so that you can confirm what the correct count is. – Levon Jul 24 '16 at 20:20
  • I tried now with gVim, same result: http://prntscr.com/bwzp45 The file contains special characters, but I don't think the 2/3 of the file, can this be a problem? – Keaire Jul 24 '16 at 20:32
  • You want to show a screenshot of a small part of your file? – Moses Koledoye Jul 24 '16 at 20:38
  • http://prntscr.com/bwzuxh It's the human-wordlist that you can find on CrackStation – Keaire Jul 24 '16 at 20:45
  • You should check what @Daniel said. Given you're on Windows, if Python uses `"\r\n"` as newline separator, but many lines in your file only use `"\n"`, this will give lines like `"foo\nbar\nbaz\r\n"`, which yield less lines. If all lines use `"\n"`, Python will just see one big line... – λuser Jul 24 '16 at 21:50

2 Answers2

0

I can think of two possibilities.

  1. You are running 2.x on Windows, the file contains about 3x as many '\r' characters as recognized '\r\n' or '\n' line endings, and EmEditor recognizes '\r' as a line ending, as does Python 3.x. Or something similar is happening with 2.7 on another OS.

Explanation: you open the file in text mode. 3.x uses OS-independent universal newlines are used and '\r' and '\r\n' are converted to '\n'. 2.x uses OS-dependent reading and on Windows, only '\r\n' is used.

Example:

with open('tem.dat', 'wb') as f:
    f.write(b'a\rb\r\nc\n\rd\n')
with open('tem.dat', 'r') as f:
    for i, t in enumerate(f):
        print(i, t, repr(t[-1]))

3.x prints

0 a
 '\n'
1 b
 '\n'
2 c
 '\n'
3 
 '\n'
4 d
 '\n'

2.x prints

(0, 'a\rb\n', "'\\n'")
(1, 'c\n', "'\\n'")
(2, '\rd\n', "'\\n'")

Diagnosis: add to your code "if '\r' in fline: print(fline)" before processing.

  1. There is something in the file that Python sees as End-of-File and EmEditor does not. Diagnosis. Add 'length = 0' before the loop and 'length += len(fline)' in the loop and see if it is at least approximately right after.
Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
0

The numbers don't match because the encoding that the open function uses isn't correct for this file, try to use the "ISO-8859-1" encoding.

AncientSwordRage
  • 7,086
  • 19
  • 90
  • 173