The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n
and \n\r
.
Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n
, \n\r
, \r
and \n
using a regex. The trick is to look for the \r\n
and \n\r
pairs before looking for the single char EOL markers.
For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.
#!/usr/bin/env python
''' Find and count various line ending character combinations
From http://stackoverflow.com/q/29695861/4014959
Written by PM 2Ring 2015.04.17
'''
import random
import re
from itertools import groupby
random.seed(42)
#Make a random text string containing various EOL combinations
tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r']
datasize = 300
data = ''.join([random.choice(tokens) for _ in range(datasize)])
print repr(data), '\n'
#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')
eols = pat.findall(data)
print eols, '\n'
grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)
output
'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r'
['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r']
[(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')]
Here's a version that reads the data from a named file, following the pattern of the code in the question.
import re
from itertools import groupby
import sys
if not sys.argv[1:]:
exit('usage: %s <filename>' % sys.argv[0])
with open(sys.argv[1], 'rb') as f:
data = f.read()
print repr(data), '\n'
#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')
eols = pat.findall(data)
print eols, '\n'
grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)