3

I am trying to split a text which uses a mix of new line characters LF, CRLF and NEL. I need the best method to exclude NEL character out of the scene.

Is there an option to instruct readlines() to exlude NEL while splitting lines? I may be able to read() and go for matching only LF and CRLF split points on a loop.

Is there any better solution?

I open the file with codecs.open() to open utf-8 text file.

And while using readlines(), it does split at NEL characters:

session screenshot

The file contents are:

"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Gaudha
  • 35
  • 1
  • 6

2 Answers2

9

file.readlines() will only ever split on \n, \r or \r\n depending on the OS and if universal newline support is enabled.

U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines() ignore it.

Quoting the open() function documentation:

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

and the universal newlines glossary entry:

A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.

Unfortunately, codecs.open() breaks with this rule; the documentation vaguely alludes to the specific codec being asked:

Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.

Instead of codecs.open(), use io.open() to open the file in the correct encoding, then process the lines one by one:

with io.open(filename, encoding=correct_encoding) as f:
    lines = f.open()

io is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n, \r and \r\n:

>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']

The codecs.open() result is due to the code using str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I honour your answer which is informative.But, the problem still not solved. I do not want to remove NEL characters. I just want to exclude it while splitting lines. – Gaudha Jan 06 '15 at 22:04
  • @Gaudha: they are **not split on**. They don't need ignoring. If they are, you don't have NEL characters. Can you show us a `repr()` representation of your data? – Martijn Pieters Jan 06 '15 at 22:06
  • 1
    @Gaudha: in other words, please supply (in your question) a [MCVE](http://stackoverflow.com/help/mcve) by which you demonstrate the problem and explain the expected outcome instead. – Martijn Pieters Jan 06 '15 at 22:14
  • `"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"` – Gaudha Jan 06 '15 at 23:20
  • @Gaudha right, it appears that the file object produced by `codecs.open()` does split on NEL. You cannot configure that. Does `io.open()` do the same? If so, use regular `open()` and decode each line (`[line.decode('utf8') for line in open(filename)]`). I'll look into more options tomorrow. – Martijn Pieters Jan 06 '15 at 23:52
  • 2
    @Gaudha: I now had time to suss this out properly. `codecs.open()` indeed splits on U+0085, and that behaviour is at best *very poorly* documented. Use `io.open()` instead, and I've linked you to the bug reports that are at the hart of this difference. – Martijn Pieters Jan 07 '15 at 09:43
0
import re

f = [re.sub(' \\r ', '', str(line)) for line in open('file.csv', 'rb')]

Will create a list of strings that will ignore additional \r characters. Each element in the list will be a line from the file. I had a similar issue and this worked on my csv. You may need to change the regex expression in the re.sub section to fit your needs.

NOTE: This will get rid of the \r character and replace it with ''. I was wanting to get rid of them, so it worked for me.

sergej
  • 1,077
  • 1
  • 14
  • 20