Get newline stats for a text file in Python

Question

I had a nasty CRLF / LF conflict in git file that was probably committed from Windows machine. Is there a cross-platform way (preferably in Python) to detect what type of newlines is dominant through the file?

I've got this code (based on idea from https://stackoverflow.com/a/10562258/239247):

import sys
if not sys.argv[1:]:
  sys.exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1],"rb") as f:
  d = f.read()
  crlf, lfcr = d.count('\r\n'), d.count('\n\r')
  cr, lf = d.count('\r'), d.count('\n')
  print('crlf: %s' % crlf)
  print('lfcr: %s' % lfcr)
  print('cr: %s' % cr)
  print('lf: %s' % lf)
  print('\ncr-crlf-lfcr: %s' % (cr - crlf - lfcr))
  print('lf-crlf-lfcr: %s' % (lf - crlf - lfcr))
  print('\ntotal (lf+cr-2*crlf-2*lfcr): %s\n' % (lf + cr - 2*crlf - 2*lfcr))

But it gives the stats wrong (for this file):

crlf: 1123
lfcr: 58
cr: 1123
lf: 1123

cr-crlf-lfcr: -58
lf-crlf-lfcr: -58

total (lf+cr-2*crlf-2*lfcr): -116

Like sorrat, I get 1123 crlf pairs for that file, with 0 for the 3 other EOL markers. — PM 2Ring, Apr 17 '15 at 11:30
@PM2Ring I need a better test file. I thought that this one actually contained mixed linefeeds. — anatoly techtonik, Apr 17 '15 at 12:55

sorrat · Accepted Answer · 2020-05-01T20:22:34.777

9

import sys


def calculate_line_endings(path):
    # order matters!
    endings = [
        b'\r\n',
        b'\n\r',
        b'\n',
        b'\r',
    ]
    counts = dict.fromkeys(endings, 0)

    with open(path, 'rb') as fp:
        for line in fp:
            for x in endings:
                if line.endswith(x):
                    counts[x] += 1
                    break
    print(counts)


if __name__ == '__main__':
    if len(sys.argv) == 2:
        calculate_line_endings(sys.argv[1])

    sys.exit('usage: %s <filepath>' % sys.argv[0])

Gives output for your file

crlf: 1123
lfcr: 0
cr: 0
lf: 0

Is it enough?

edited May 01 '20 at 20:22

answered Apr 17 '15 at 11:17

sorrat

873
6
11

This one is good. Do you know how the `line in open(filename, "rb"):` detects the lines correctly? Just to know about corner cases. – anatoly techtonik Apr 17 '15 at 14:52
Sorry, I don't know. May be the cause in [PEP-278](https://www.python.org/dev/peps/pep-0278/) – sorrat Apr 18 '15 at 17:38

PM 2Ring · Answer 2 · 2015-04-17T11:35:36.567

The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n and \n\r.

Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n, \n\r, \r and \n using a regex. The trick is to look for the \r\n and \n\r pairs before looking for the single char EOL markers.

For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.

#!/usr/bin/env python

''' Find and count various line ending character combinations

    From http://stackoverflow.com/q/29695861/4014959

    Written by PM 2Ring 2015.04.17
'''

import random
import re
from itertools import groupby

random.seed(42)

#Make a random text string containing various EOL combinations
tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r']
datasize = 300
data = ''.join([random.choice(tokens) for _ in range(datasize)])
print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)

output

'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r' 

['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r'] 

[(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')]

Here's a version that reads the data from a named file, following the pattern of the code in the question.

import re
from itertools import groupby
import sys

if not sys.argv[1:]:
    exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1], 'rb') as f:
    data = f.read()

print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)

Nice approach. Especially cool that it has test data to compare. — anatoly techtonik, Apr 17 '15 at 20:19

go2 · Answer 3 · 2015-04-17T11:27:25.733

From what I see, I would recommend to check if you have the following case: \r\n\r\n\r\n. Following your code this will count the following:

crlf: 3 -- [\r\n][\r\n][\r\n]
lfcr: 2 -- \r[\n\r][\n\r]\n
cr: 3   -- [\r]\n[\r]\n[\r]\n
lf: 3   -- \r[\n]\r[\n]\r[\n]

cr-crlf-lfcr: -2
lf-crlf-lfcr: -2

total (lf+cr-2*crlf-2*lfcr): -4

As you can see some \n's and some \r's are counted twice for crlf and lfcr. Instead you can just read line by line and count the line endings line.endswith(). To get exact statistics for cr and lf then you can count \r\n and \n\r as cr+1 and lf+1.

score 1 · Answer 4 · answered Apr 17 '15 at 11:23

1

The best way to deal with line endings in git is to use git configuration. You can define what exactly must be done to line endings globally, in a particular repository or for specific files. In .gitattributes file, you can define that certain files must be converted to the native line endings of your system for each checkout, and converted back at checkins. See GitHub line endings help for a detailed description.

answered Apr 17 '15 at 11:23

Mykhaylo Kopytonenko

923
7
15

I don't want to convert anything, Can git just leave my files as-is by default? – anatoly techtonik Apr 17 '15 at 12:51

Get newline stats for a text file in Python

4 Answers4

Linked