I'm using Python 2.7 to compare two text files line by line, ignoring:
- different line endings ('\r\n' vs '\n')
- number of empty lines at the end of the files
Below is the code I have. It works for point 2., but it does not work for point 1. The files I'm comparing can be big, so I'm reading them line by line. Please, don't suggest zip or similar libraries.
def compare_files_by_line(fpath1, fpath2):
# notice the opening mode 'r'
with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
file1_end = False
file2_end = False
found_diff = False
while not file1_end and not file2_end and not found_diff:
try:
# reasons for stripping explained below
f1_line = next(file1).rstrip('\n')
except StopIteration:
f1_line = None
file1_end = True
try:
f2_line = next(file2).rstrip('\n')
except StopIteration:
f2_line = None
file2_end = True
if f1_line != f2_line:
if file1_end or file2_end:
if not (f1_line == '' or f2_line == ''):
found_diff = True
break
else:
found_diff = True
break
return not found_diff
You can test this code failing to meet point 1. by feeding it 2 files, one having a line ending with a UNIX newline
abc\n
the other having a line ending with a Windows newline
abc\r\n
I'm stripping the endline characters before each comparison to account for point 2. This solves the problem of two files containing a series of identical lines, this code will recognize them as "not different" even if one file ends with an empty line while the other one does not.
According to this answer, opening the files in 'r' mode (instead of 'rb') should take care of the OS-specific line endings and read them all as '\n'. This is not happening.
How can I make this work to treat line endings '\r\n' just as '\n' endings?
I'm using Python 2.7.12 with the Anaconda distribution 4.2.0.