4

For some large file,

lines_a = len(fa.readlines())
print(lines_a)

And for Bash (on Mac):

wc -l

the result are different!

What is the possible reason?

Daniel R. Livingston
  • 1,227
  • 14
  • 36
Andy Yuan
  • 458
  • 5
  • 12

3 Answers3

8

wc -l prints the number of newlines in input. In other words, its definition of "line" in "line count" requires the line to end with a newline, and is actually defined by POSIX.

This definition of line can yield surprising behavior if the last line in your file does not end with a newline. Despite such line being displayed in text editors and pagers just fine, wc will not count it as a line. For example:

$ printf 'foo\nbar\n' | wc -l
2
$ printf 'foo\nbar' | wc -l
1

Python's readlines() method, on the other hand, is designed to provide the data in the file so that it can be perfectly reconstructed. For that reason, it provides each line with the final newline, and the last non-empty line as-is (with or without the final newline). For the above example, it returns lists ["foo\n", "bar\n"] and ["foo\n", "bar"] respectively, both of length two:

$ printf 'foo\nbar' | python -c 'import sys; print len(sys.stdin.readlines())'
2
$ printf 'foo\nbar\n' | python -c 'import sys; print len(sys.stdin.readlines())'
2
user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • your give me a reasonable explanation, but I got a big file, millions of line, the different results from the "wc -f " and "len(readlines())" ,however, I check this file by a script, each line contains just one '\n' at the end of the line, so I guess there must be another reason, do you have any more idea? – Andy Yuan Jun 13 '17 at 12:49
  • or maybe I should ask, Is there a way in python that treat the such a line like "aaa\nbbb" as just one line? – Andy Yuan Jun 13 '17 at 12:59
  • @AndyYuan Sorry, I don't know what `wc -f` does. Also, if the file is so big, maybe it is being written to while `wc` is operating, which could explain the difference. – user4815162342 Jun 13 '17 at 14:13
  • @ user4815162342 sorry for my mistake, it should be 'wc -l', my question is whether python has a function to take the such a line like "aaa\nbbb\n" as just one line – Andy Yuan Jun 14 '17 at 05:44
  • @AndyYuan "aaa\nbbb\n" is two lines. If you want to "take it as one line", how do you know when to stop reading it? Python file objects have a `read()` method that return the whole file contents as a string; perhaps you can use that, and then split the resulting string as desired. – user4815162342 Jun 14 '17 at 05:57
3

Just mention that I met similar problem when I was doing machine translation task. The main reason that the line number is not right, maybe because you have not open the file in 'b' mode. So try to

with open('some file', 'rb') as f:
    print(len(f.readlines()))

You will get the same number as wc -l

Zhen Yang
  • 31
  • 1
1

This could also happen if you have \r in your text file.

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

^ from python textiowrapper documentation.

Gowtham Ramesh
  • 83
  • 1
  • 10