0

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:

with open('test.txt', 'wb') as f:  # simulate a file with weird end-of-lines
    f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
    for l in f:
        print(l)
# b'abc\r\r\n'         
# b'def'

I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:

print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']

Even with keepends=True, it's not the same result.

Question: how to have the same behaviour than for l in f with splitlines()?

Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232

Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    are you looking for `splitlines()` behaviour with the `for l in f`? the former split the lines on much more cases, whlie the latter split on `\n`. – Dinari Jan 17 '21 at 20:09
  • Does this answer your question? [How do I wrap a string in a file in Python?](https://stackoverflow.com/questions/141449/how-do-i-wrap-a-string-in-a-file-in-python) – mkrieger1 Jan 17 '21 at 20:09
  • No @mkrieger1, I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked). I want to keep a simple string. – Basj Jan 17 '21 at 20:11
  • I should have mentioned it, it's now fixed in the edit. – Basj Jan 17 '21 at 20:14
  • If you are reading the strings from files anyway, why don't you use the `for l in f` interface, then? – mkrieger1 Jan 17 '21 at 20:14
  • For a complex reason, a little bit out of topic here, but to be precise, because of this: https://stackoverflow.com/questions/65763959/speed-up-reading-in-a-compressed-bz2-file-rb-mode – Basj Jan 17 '21 at 20:15

4 Answers4

1

There are a couple ways to do this, but none are especially fast.

If you want to keep the line endings, you might try the re module:

lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)

If you need the endings and the file is really big, you may want to iterate instead:

for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
    line = r.group()
    # do stuff with line here

If you don't need the endings, then you can do it much more easily:

lines = list(filter(None, text.splitlines()))

You can omit the list() part if you just iterate over the results (or if using Python2):

for line in filter(None, text.splitlines()):
    pass # do stuff with line
Pi Marillion
  • 4,465
  • 1
  • 19
  • 20
1

Why don't you just split it:

input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n') 
print(result)

[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']

You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like

fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
    fixed[-1] = fixed[-1][:-1]
print(fixed)

[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']

Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :

def bin_split(input_str):
    start = 0
    while start>=0 :
        found = input_str.find(b'\n', start) + 1
        if 0 < found < len(input_str):
            yield input_str[start : found]
            start = found
        else:
            yield input_str[start:]
            break
igrinis
  • 12,398
  • 20
  • 45
  • Sometimes the simplest solution is the better. How couldn't I think about this! It seems to perfectly work, even with weird end-of-lines like `\r\r\r\n`, as well as `\r\n` or `\n`. Thanks! – Basj Jan 28 '21 at 20:34
0

I would iterate through like this:

text  = "b'abc\r\r\ndef'"

results = text.split('\r\r\n')

for r in results:
    print(r)
Matt Cottrill
  • 152
  • 1
  • 1
  • 15
  • Thanks, but I'm looking for a general solution, I don't want to hardcode `\r\r\n`. Each file I'm processing from different sources might use different end of lines. – Basj Jan 17 '21 at 20:06
  • How are you identifying the delimiter then? This should work with any delimiter – Matt Cottrill Jan 17 '21 at 20:18
  • Some files have weird delimiters, some files might use some delimiters in the beginning then others. I don't have control on the source data I'm processing. – Basj Jan 17 '21 at 20:19
0

This is a for l in f: solution:

The key to this is the newline argument on the open call. From the documentation:

[![enter image description here][1]][1]

Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:

with open('test.txt', 'w', newline='') as f:
    f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
    for line in f:
        print(repr(line))

Prints:

'abc\r\r\n'
'def'

A quasi-splitlines solution:

This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:

import re

input = '\nabc\r\r\ndef\nghi\r\njkl'

with open('test.txt', 'w', newline='') as f:
    f.write(input)
with open('test.txt', 'r', newline='') as f:
    text = f.read()
    lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
    for line in lines:
        print(repr(line))

Prints:

'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'

Regex Demo

Booboo
  • 38,656
  • 3
  • 37
  • 60