13

In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same text) then the file with line breaks will take around 100-200% the time to read. I have provided a specific example.

Step #1: Create the files

sizeMB = 128
sizeKB = 1024 * sizeMB

with open(r'C:\temp\bigfile_one_line.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\t'*73)  # There are roughly 73 phrases in one KB

with open(r'C:\temp\bigfile_newlines.txt', 'w') as f:
    for i in range(sizeKB):  
        f.write('Hello World!\n'*73)

Step #2: Read the file with one single line and time performance

IPython

%%timeit
with open(r'C:\temp\bigfile_one_line.txt', 'r') as f:
    text = f.read()

Output

1 loop, best of 3: 368 ms per loop

Step #3: Read the file with many lines and time performance

IPython

%%timeit
with open(r'C:\temp\bigfile_newlines.txt', 'r') as f:
    text = f.read()

Output

1 loop, best of 3: 589 ms per loop

This is just one example. I have tested this for many different situations, and they do the same thing:

  1. Different file sizes from 1MB to 2GB
  2. Using file.readlines() instead of file.read()
  3. Using a space instead of tab ('\t') in the single line file (i.e. 'Hello World! ')

My conclusion is that files with new lines characters ('\n') take longer to read than files without them. However, I would expect all characters to be treated the same. This can have important consequences for performance when reading a lot of files. Does anyone know why this happens?

I am using Python 3.6.1, Anaconda 4.3.24, and Windows 10.

martineau
  • 119,623
  • 25
  • 170
  • 301
pwaivers
  • 315
  • 3
  • 9
  • Interesting find. One thought that immediately comes to mind is that this might not be Python's fault -- could be related to your OS or your filesystem. Would be worth testing on other systems. – Hayden Schiff Sep 25 '17 at 23:26
  • 2
    Just a thought, on Windows, opening in text-mode converts `'\n'` characters to `'\r\n'` when you *write*, and the reverse when you *read*. This might explain it. Try opening in binary mode. – juanpa.arrivillaga Sep 25 '17 at 23:34
  • 1
    I'd expect [universal newlines](https://docs.python.org/3/glossary.html#term-universal-newlines) handling to also slow things down. – user2357112 Sep 25 '17 at 23:39
  • @user2357112 yep! – juanpa.arrivillaga Sep 25 '17 at 23:41
  • Seems pretty obvious if you think about it, so it's hard to understand all the up votes... – martineau Sep 26 '17 at 00:17
  • @martineau: It's not really obvious. Even if you're aware of the fact that line breaks have to be translated, and even if you're aware of the existence of Python's universal newlines feature, it's still not obvious that this feature would actually have to perform more work (especially so *much* more work) when it actually finds line endings that need to be translated. Line ending translation usually doesn't have this kind of performance impact. – user2357112 Sep 26 '17 at 00:25
  • @user2357112: Ignoring the universal newline feature of Python—which only exacerbates the overhead—it's should be evident that splitting the raw data in the file up into pieces delimited by one or more arbitrary special values is going to require more processing than not doing it. – martineau Sep 26 '17 at 14:34
  • @martineau: The file has no concept of delimiters or special values, right? My code does not ask for the text file to be separated onto separate lines, therefore it would treat special characters like '\r' and '\n' as any other characters ('0A' and '0D' in ASCII). I think the _only_ overhead comes from the universal newlines feature. – pwaivers Sep 26 '17 at 14:43
  • pwaivers: Again, it seems apparent that when reading a file in default "text" mode, where every byte of data must be examined to see if it's one of the special characters that needs special handling (such as being translated/replaced), that it will take longer to do this when there are many such values in the file than when there aren't. – martineau Sep 26 '17 at 15:03
  • @martineau: Using the OS's line break translation (as happens on Python 2, for example), there's essentially no overhead. The speed is limited by other factors, like disk read rates. If universal newlines were an OS-level feature instead of happening in a separate post-processing pass, I doubt you'd see a measurable overhead for universal newlines, either. – user2357112 Sep 26 '17 at 16:27

3 Answers3

10

When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only \n in Python. So a file containing foo\nbar would appear the same as a file containing foo\r\nbar or foo\rbar (since \n, \r\n and \r are all line ending conventions used on some operating systems at some time).

The logic that provides that support is probably what causes your performance differences. Even if the \n characters in the file are not being transformed, the code needs to examine them more carefully than it does non-newline characters.

I suspect the performance difference you see will disappear if you opened your files in binary mode where no such newline support is provided. You can also pass a newline parameter to open in Python 3, which can have various meanings depending on exactly what value you give. I have no idea what impact any specific value would have on performance, but it might be worth testing if the performance difference you're seeing actually matters to your program. I'd try passing newline="" and newline="\n" (or whatever your platform's conventional line ending is).

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • @pwaivers Don't leave us hanging! What happens to your benchmark when you open the files in binary mode? – jpaugh Oct 25 '17 at 19:15
5

However, I would expect all characters to be treated the same.

Well, they're not. Line breaks are special.

Line breaks aren't always represented as \n. The reasons are a long story dating back to the early days of physical teleprinters, which I won't go into here, but where that story has ended up is that Windows uses \r\n, Unix uses \n, and classic Mac OS used to use \r.

If you open a file in text mode, the line breaks used by the file will be translated to \n when you read them, and \n will be translated to your OS's line break convention when you write. In most programming languages, this is handled on the fly by OS-level code and pretty cheap, but Python does things differently.

Python has a feature called universal newlines, where it tries to handle all line break conventions, no matter what OS you're on. Even if a file contains a mix of \r, \n, and \r\n line breaks, Python will recognize all of them and translate them to \n. Universal newlines is on by default in Python 3 unless you configure a specific line ending convention with the newline argument to open.

In universal newlines mode, the file implementation has to read the file in binary mode, check the contents for \r\n characters, and

construct a new string object with line endings translated

if it finds \r or \r\n line endings. If it only finds \n endings, or if it finds no line endings at all, it doesn't need to perform the translation pass or construct a new string object.

Constructing a new string and translating line endings takes time. Reading the file with the tabs, Python doesn't have to perform the translation.

user2357112
  • 260,549
  • 28
  • 431
  • 505
3

On Windows, opening in text-mode converts '\n' characters to '\r\n' when you write, and the reverse when you read.

So, I did some experimentation. I am on MacOS, right now, so my "native" line-ending is '\n', so I cooked up a similar test to yours, except use non-native, Windows line-endings:

sizeMB = 128
sizeKB = 1024 * sizeMB

with open(r'bigfile_one_line.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!!\t'*73)  # There are roughly 73 phrases in one KB

with open(r'bigfile_newlines.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\r\n'*73)

And the results:

In [4]: %%timeit
   ...: with open('bigfile_one_line.txt', 'r') as f:
   ...:     text = f.read()
   ...:
1 loop, best of 3: 141 ms per loop

In [5]: %%timeit
   ...: with open('bigfile_newlines.txt', 'r') as f:
   ...:     text = f.read()
   ...:
1 loop, best of 3: 543 ms per loop

In [6]: %%timeit
   ...: with open('bigfile_one_line.txt', 'rb') as f:
   ...:     text = f.read()
   ...:
10 loops, best of 3: 76.1 ms per loop

In [7]: %%timeit
   ...: with open('bigfile_newlines.txt', 'rb') as f:
   ...:     text = f.read()
   ...:
10 loops, best of 3: 77.4 ms per loop

Very similar to yours, and note, the performance difference disappears when I open in binary mode. OK, what if instead, I use *nix line-endings?

with open(r'bigfile_one_line_nix.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\t'*73)  # There are roughly 73 phrases in one KB

with open(r'bigfile_newlines_nix.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\n'*73)

And the results using these new file:

In [11]: %%timeit
    ...: with open('bigfile_one_line_nix.txt', 'r') as f:
    ...:     text = f.read()
    ...:
10 loops, best of 3: 144 ms per loop

In [12]: %%timeit
    ...: with open('bigfile_newlines_nix.txt', 'r') as f:
    ...:     text = f.read()
    ...:
10 loops, best of 3: 138 ms per loop

Aha! The performance difference disappears! So yes, I think using non-native line-endings impacts performance, which makes sense given the behavior of text-mode.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 1
    I just completed the same test in Linux and got very similar results: the file that uses `\r\n` endings took approx twice as long to read. Python 3 performs new line translation, so the `\r\n` characters are automatically converted to `\n`, which obviously takes time. – mhawke Sep 25 '17 at 23:52
  • As an alternative to switching to binary mode (which toggles a ton of different behaviors, not just line ending modes), you might try passing `newline='\n'` (both for write and for read) which explicitly disables universal newline mode in favor of an OS-agnostic use of `\n` only. – ShadowRanger Sep 26 '17 at 00:26
  • 1
    @ShadowRanger: Or since this is Windows, `newline='\r\n'` would probably be more appropriate. – user2357112 Sep 26 '17 at 00:26