In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same text) then the file with line breaks will take around 100-200% the time to read. I have provided a specific example.
Step #1: Create the files
sizeMB = 128
sizeKB = 1024 * sizeMB
with open(r'C:\temp\bigfile_one_line.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\t'*73) # There are roughly 73 phrases in one KB
with open(r'C:\temp\bigfile_newlines.txt', 'w') as f:
for i in range(sizeKB):
f.write('Hello World!\n'*73)
Step #2: Read the file with one single line and time performance
IPython
%%timeit
with open(r'C:\temp\bigfile_one_line.txt', 'r') as f:
text = f.read()
Output
1 loop, best of 3: 368 ms per loop
Step #3: Read the file with many lines and time performance
IPython
%%timeit
with open(r'C:\temp\bigfile_newlines.txt', 'r') as f:
text = f.read()
Output
1 loop, best of 3: 589 ms per loop
This is just one example. I have tested this for many different situations, and they do the same thing:
- Different file sizes from 1MB to 2GB
- Using file.readlines() instead of file.read()
- Using a space instead of tab ('\t') in the single line file (i.e. 'Hello World! ')
My conclusion is that files with new lines characters ('\n') take longer to read than files without them. However, I would expect all characters to be treated the same. This can have important consequences for performance when reading a lot of files. Does anyone know why this happens?
I am using Python 3.6.1, Anaconda 4.3.24, and Windows 10.