5

I had a nasty CRLF / LF conflict in git file that was probably committed from Windows machine. Is there a cross-platform way (preferably in Python) to detect what type of newlines is dominant through the file?

I've got this code (based on idea from https://stackoverflow.com/a/10562258/239247):

import sys
if not sys.argv[1:]:
  sys.exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1],"rb") as f:
  d = f.read()
  crlf, lfcr = d.count('\r\n'), d.count('\n\r')
  cr, lf = d.count('\r'), d.count('\n')
  print('crlf: %s' % crlf)
  print('lfcr: %s' % lfcr)
  print('cr: %s' % cr)
  print('lf: %s' % lf)
  print('\ncr-crlf-lfcr: %s' % (cr - crlf - lfcr))
  print('lf-crlf-lfcr: %s' % (lf - crlf - lfcr))
  print('\ntotal (lf+cr-2*crlf-2*lfcr): %s\n' % (lf + cr - 2*crlf - 2*lfcr))

But it gives the stats wrong (for this file):

crlf: 1123
lfcr: 58
cr: 1123
lf: 1123

cr-crlf-lfcr: -58
lf-crlf-lfcr: -58

total (lf+cr-2*crlf-2*lfcr): -116
Community
  • 1
  • 1
anatoly techtonik
  • 19,847
  • 9
  • 124
  • 140

4 Answers4

9
import sys


def calculate_line_endings(path):
    # order matters!
    endings = [
        b'\r\n',
        b'\n\r',
        b'\n',
        b'\r',
    ]
    counts = dict.fromkeys(endings, 0)

    with open(path, 'rb') as fp:
        for line in fp:
            for x in endings:
                if line.endswith(x):
                    counts[x] += 1
                    break
    print(counts)


if __name__ == '__main__':
    if len(sys.argv) == 2:
        calculate_line_endings(sys.argv[1])

    sys.exit('usage: %s <filepath>' % sys.argv[0])

Gives output for your file

crlf: 1123
lfcr: 0
cr: 0
lf: 0

Is it enough?

sorrat
  • 873
  • 6
  • 11
2

The posted code doesn't work properly bcause Counter is counting characters in the file - it doesn't look for character pairs like \r\n and \n\r.

Here's some Python 2.6 code that finds each occurrence of the 4 EOL markers \r\n, \n\r, \r and \n using a regex. The trick is to look for the \r\n and \n\r pairs before looking for the single char EOL markers.

For testing purposes it creates some random text data; I wrote this before I noticed your link to a test file.

#!/usr/bin/env python

''' Find and count various line ending character combinations

    From http://stackoverflow.com/q/29695861/4014959

    Written by PM 2Ring 2015.04.17
'''

import random
import re
from itertools import groupby

random.seed(42)

#Make a random text string containing various EOL combinations
tokens = list(2*'ABCDEFGHIJK ' + '\r\n') + ['\r\n', '\n\r']
datasize = 300
data = ''.join([random.choice(tokens) for _ in range(datasize)])
print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)

output

'FAHGIG\rC AGCAFGDGEKAKHJE\r\nJCC EKID\n\rKD F\rEHBGICGCHFKKFH\r\nGFEIEK\n\rFDH JGAIHF\r\n\rIG \nAHGDHE\n G\n\rCCBDFK BK\n\rC\n\r\rAIHDHFDAA\r\n\rHCF\n\rIFFEJDJCAJA\r\n\r IB\r\r\nCBBJJDBDH\r FDIFI\n\rGACDGJEGGBFG\n\rBGGFD\r\nDBJKFCA BIG\n\rC J\rGFA HG\nA\rDB\n\r \n\r\n EBF BK\n\rHJA \r\n\n\rDIEI\n\rEDIBEC E\r\nCFEGGD\rGEF EC\r\nFIG GIIJCA\n\r\n\rCFH\r\n\r\rKE HF\n\rGAKIG\r\nDDCDHEIFFHB\n C HAJFHID AC\r' 

['\r', '\r\n', '\n\r', '\r', '\r\n', '\n\r', '\r\n', '\r', '\n', '\n', '\n\r', '\n\r', '\n\r', '\r', '\r\n', '\r', '\n\r', '\r\n', '\r', '\r', '\r\n', '\r', '\n\r', '\n\r', '\r\n', '\n\r', '\r', '\n', '\r', '\n\r', '\n\r', '\n', '\n\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r\n', '\n\r', '\n\r', '\r\n', '\r', '\r', '\n\r', '\r\n', '\n', '\r'] 

[(17, '\n\r'), (14, '\r'), (12, '\r\n'), (5, '\n')]

Here's a version that reads the data from a named file, following the pattern of the code in the question.

import re
from itertools import groupby
import sys

if not sys.argv[1:]:
    exit('usage: %s <filename>' % sys.argv[0])

with open(sys.argv[1], 'rb') as f:
    data = f.read()

print repr(data), '\n'

#regex to find various EOL combinations
pat = re.compile(r'\r\n|\n\r|\r|\n')

eols = pat.findall(data)
print eols, '\n'

grouped = [(len(list(group)), key) for key, group in groupby(sorted(eols))]
print sorted(grouped, reverse=True)
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
1

From what I see, I would recommend to check if you have the following case: \r\n\r\n\r\n. Following your code this will count the following:

crlf: 3 -- [\r\n][\r\n][\r\n]
lfcr: 2 -- \r[\n\r][\n\r]\n
cr: 3   -- [\r]\n[\r]\n[\r]\n
lf: 3   -- \r[\n]\r[\n]\r[\n]

cr-crlf-lfcr: -2
lf-crlf-lfcr: -2

total (lf+cr-2*crlf-2*lfcr): -4

As you can see some \n's and some \r's are counted twice for crlf and lfcr. Instead you can just read line by line and count the line endings line.endswith(). To get exact statistics for cr and lf then you can count \r\n and \n\r as cr+1 and lf+1.

go2
  • 378
  • 4
  • 14
1

The best way to deal with line endings in git is to use git configuration. You can define what exactly must be done to line endings globally, in a particular repository or for specific files. In .gitattributes file, you can define that certain files must be converted to the native line endings of your system for each checkout, and converted back at checkins. See GitHub line endings help for a detailed description.