4

What's the best (cleanest & fastest) way to replace all "universal newlines" by b'\n', in a bytes Python3 object?


EDIT: I ended up using b'\n'.join(bytestr.splitlines()) because it seemed the safest, and I don't mind dropping the one potential newline at the end.

But see the excellent answer by @norok2 below for caveats, timings and a faster solution.

user124114
  • 8,372
  • 11
  • 41
  • 63

4 Answers4

18

A bit late to the party, but let's see what I can contribute to.

First, a disclosure: my favorite is @JohnHennig's double-.replace() method, because it is reasonably fast and it is crystal clear what is happening.

I think there are no other simple and fast solutions within standard Python beyond what was already proposed in the other answers (some of which I slightly modified to obtain the exact same result as the double-.replace()).

However, it may be possible to speed things up. Here I propose 3 additional solutions: two use Cython and one uses Numba.

For simplicity, I wrote this with IPython using the Cython magic.

%load_ext Cython

The core idea is that it is sufficient to loop through the input only once as long as we copy the data to another string on the go.

Coding this straight away in Python is simple, but to make it feasible we need to use bytearray() to overcome the immutability of str / bytes. The slow loop can be compiled for speed with Cython (unl_loop_cy()).

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


def unl_loop_cy(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    n = len(b)
    result = bytearray(n)
    i = j = 0
    while i + 1 <= n:
        if b[i] == nl_cr:
            result[j] = nl_lf
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result[j] = b[i]
            i += 1
        j += 1
    return bytes(result[:j])

However, neither bytes nor bytearray is compatible with Numba. To use it, we need to go through NumPy, which offers means of dealing with bytes efficiently: np.frombuffer() and np.ndarray.tobytes(). The base algorithm stays the same and the code now reads:

import numpy as np
import numba as nb


@nb.jit
def _unl_loop_nb(b, result):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    n = len(b)
    i = j = 0
    while i + 1 <= n:
        if b[i] == nl_cr:
            result[j] = nl_lf
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result[j] = b[i]
            i += 1
        j += 1
    return j


def unl_loop_nb(b):
    arr = np.frombuffer(b, np.uint8)
    result = np.empty(arr.shape, np.uint8)
    size = _unl_loop_nb(arr, result)
    return result[:size].tobytes()

With newer version of Numba adding support to bytes and np.empty(), one could write an improved version of the above:

import numpy as np
import numba as nb


@nb.jit
def _unl_loop_nb2(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    n = len(b)
    result = np.empty(n, dtype=np.uint8)
    i = j = 0
    while i + 1 <= n:
        if b[i] == nl_cr:
            result[j] = nl_lf
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result[j] = b[i]
            i += 1
        j += 1
    return result[:j]


def unl_loop_nb2(b):
    return _unl_loop_nb2(b).tobytes()

Finally, we can optimize the Cython solution further, in order to get additional speed. To do this, we replace bytearray with actual C++ strings and push as much computation as possible "outside of Python".

%%cython --cplus -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


from libcpp.string cimport string


cdef extern from *:
    """
    #include <string>
    std::string & erase(
            std::string & s,
            std::size_t pos,
            std::size_t len) {
        return s.erase(pos, len); }
    """
    string& erase(string& s, size_t pos, size_t len)


cpdef string _unl_cppstr_cy(string s):
    cdef char nl_lf = b'\n'
    cdef char nl_cr = b'\r'
    cdef char null = b'\0'
    cdef size_t s_size = s.size()
    cdef string result = string(s_size, null)
    cdef size_t i = 0
    cdef size_t j = 0
    while i + 1 <= s_size:
        if s[i] == nl_cr:
            result[j] = nl_lf
            if s[i + 1] == nl_lf:
                i += 1
        else:
            result[j] = s[i]
        j += 1
        i += 1
    return erase(result, j, i - j)


def unl_cppstr_cy(b):
    return _unl_cppstr_cy(b)

The C++-optimized solution and the Numba-accelerated approaches comes out with quite competitive timings, and they outperform the double-.replace() method (for sufficiently large inputs for the case of Numba). For somewhat smaller inputs, the C++-optimized approach comes out as the fastest, but for large enough inputs, the Numba-based approaches (and particularly the second one) gets even faster. The Cython-accelerated bytearray approach, comes out to be the slowest in the considered benchmarks. Yet it is noteworthly competitive with the other solution despite the explicit looping (because of the Cython compilation).

The benchmarks come out as follows:

bm_full

and, zooming on the fastest method:

bm_zoom


For completeness, here are the other tested functions:

def unl_replace(s):
    return s.replace(b'\r\n', b'\n').replace(b'\r', b'\n')
# EDIT: was originally the commented code, but it is less efficient
# def unl_join(s):
#     nls = b'\r\n', b'\r', b'\n'
#     return b'\n'.join(s.splitlines()) + (
#         b'\n' if any(s.endswith(nl) for nl in nls) else b'')
def unl_join(s):
    result = b'\n'.join(s.splitlines())
    nls = b'\r\n', b'\r', b'\n'
    if any(s.endswith(nl) for nl in nls):
        result += b'\n'
    return result
# Following @VPfB suggestion
def unl_join_new(s):
    return b'\n'.join((s + b'\0').splitlines())[:-1]
import re


def unl_re(s, match=re.compile(b'\r\n?')):
    return match.sub(b'\n', s)
def unl_join_naive(s):  # NOTE: not same result as `unl_replace()`
    return b'\n'.join(s.splitlines())

and this is the function used to generate the inputs:

def gen_input(num, nl_factor=0.10):
    nls = b'\r\n', b'\r', b'\n'
    words = (b'a', b'b', b' ')
    random.seed(0)
    nl_percent = int(100 * nl_factor)
    base = words * (100 - nl_percent) + nls * nl_percent
    return b''.join([base[random.randint(0, len(base) - 1)] for _ in range(num)])

and the scripts for generating the data and the plots (originally based on this) are available here.


Note

I have also tested a couple of other possible implementations with explicit looping, but I have omitted them from the comparisons because they were orders of magnitude slower than the proposed solutions (and resulted in slower timing even after compiling with Cython), but I report them here for future reference:

def unl_loop(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    n = len(b)
    result = bytearray(n)
    i = j = 0
    while i + 1 <= n:
        if b[i] == nl_cr:
            result[j] = nl_lf
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result[j] = b[i]
            i += 1
        j += 1
    return bytes(result[:j])
def unl_loop_add(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    result = b''
    i = 0
    while i + 1 <= len(b):
        if b[i] == nl_cr:
            result += b'\n'
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result += b[i:i + 1]
            i += 1
    return result
def unl_loop_append(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    result = bytearray()
    i = 0
    while i + 1 <= len(b):
        if b[i] == nl_cr:
            result.append(nl_lf)
            i += 2 if b[i + 1] == nl_lf else 1
        else:
            result.append(b[i])
            i += 1
    return bytes(result)
def unl_loop_del(b):
    nl_cr = b'\r'[0]
    nl_lf = b'\n'[0]
    b = bytearray(b)
    i = 0
    while i + 1 <= len(b):
        if b[i] == nl_cr:
            if b[i + 1] == nl_lf:
                del b[i]
            else:
                b[i] = nl_lf
        i += 1
    return bytes(b)

(EDIT: comments on assumptions / potential issues)

Assumptions / Potential Issues

For "mixed newlines" files, e.g. b'alpha\nbravo\r\ncharlie\rdelta', there will always be a theoretical ambiguity of whether \r\n is to be considered as 1 or 2 newlines. All the methods implemented above will have the same behavior and consider \r\n as a single newline.

Additionally, all these methods will have issues with \r and/or \r\n spurious presence with complex encodings, for example, taking from @JohnHennig comments, the Malayalam letter encodes to b'\r\n' in UTF-16 and bytes.splitlines() seems NOT to be aware of it, and all the tested method seems to behave equally:

s = 'ഊ\n'.encode('utf-16')
print(s)
# b'\xff\xfe\n\r\n\x00'

s.splitlines()
[b'\xff\xfe', b'', b'\x00']

for func in funcs:
    print(func(s))
# b'\xff\xfe\n\n\x00'
# b'\xff\xfe\n\n\x00'
# b'\xff\xfe\n\n\x00'
# b'\xff\xfe\n\n\x00'
# b'\xff\xfe\n\n\x00'
# b'\xff\xfe\n\n\x00'

Finally, unl_join_naive() relies only on the Python implementation line splitting, which means it is a bit less obvious what would happen, but it may get better support for these kind of issues in the future. This method is also dropping the last newline if it is at the end of the string, so some extra code (which would add a -- typically small -- constant offset in the timing) is required to overcome this behavior. A few suggestion to solve the issue include:

  • checking the last chars for the presence of newline markers at the end (which is not an issue now given the current bytes.splitlines() implementation, but could be an issue in the future if a spurious \r\n happen to be last char and bytes.splitlines() behavior get sensitive to that) as in unl_join();
  • adding any non-newline ASCII 7-bit char (e.g. \0) to the original input and remove the last element after the join() (which seems safer and faster than the previous one) as in unl_join_new().

(EDITED: To add a simpler Cython-solution, a Numba-based solution and update the timings). (EDITED: To add another Numba-based solution, requiring bytes and np.empty() support, and update the timings).

norok2
  • 25,683
  • 4
  • 73
  • 99
  • Cool! Can you add the plain `join` (without checking the final newline) into the plots, too? Just for comparison sake. – user124114 Jun 29 '19 at 20:24
  • Sure, but as you can easily guess, it is just a small offset, which is only significant for very small inputs. – norok2 Jun 29 '19 at 21:01
  • 1
    I'm leaning toward accepting this as the Bounty answer. You already covered the *performance* very well. Can you also expand the *safety* = deviations from Python's built-in behaviour, corner cases, potential problems? So that the answer is really canonical and complete. Thanks! – user124114 Jul 02 '19 at 13:00
  • 1
    @user124114 I have added @VPfB approach for correcting `unl_join()` as well as commented on issues with exotic encodings. – norok2 Jul 02 '19 at 20:37
6

Here is what I have used in the past:

>>> bytestr = b'A sentence\rextending over\r\nmultiple lines.\n'
>>> bytestr.replace(b'\r\n', b'\n').replace(b'\r', b'\n')
b'A sentence\nextending over\nmultiple lines.\n'

I don't know if it's the best way, but it is straightforward and easy to reason about. For example, it's key to replace the two-byte sequence first and the remaining isolated \r characters second.

Even though the above example mixes different types of newline byte sequences, there is an implicit assumption that the approach only be used on input that uses the same one throughout. It is simply agnostic to whichever newline that may be. Case in point: b'\r\r\n\n' does not have a unique interpretation if newlines were allowed to be mixed, as it may then represent either 3 or 4 empty lines.

john-hen
  • 4,410
  • 2
  • 23
  • 40
  • Looks reasonable, but does this cover all the edge cases? Isn't `b'\n'.join(bytes.splitlines())` safer? – user124114 Jun 22 '19 at 12:47
  • 2
    Actually, `splitlines()` will drop the final newline at the end (if any), so your solution here may be preferable (depending on whether you want to keep the final newline or not). – user124114 Jun 22 '19 at 13:29
  • My two pence here is that double-replace is the most unambiguous, clear and fast way of approaching the problem. I mean could you tell on top of your head what `splitlines/join` would do to `\r\r\n\n`? Beyond that I would not expect any new encoding for newline to show up any time soon (if ever -- we experienced already the hard way how painful it is the situation at the moment, the ASCII standard does not provide much room either and the only other option left `\n\r` would make mixed new-line files too ambiguous anyway), so I would not see too much room for future-proofing this. – norok2 Jul 01 '19 at 15:26
5

regular expressions can also work with bytes objects. What about:

import re

data = b"hello\r\n world\r\naaaaa\rbbbbbb"

print(re.sub(b"\r\n?",b"\n",data))

result:

b'hello\n world\naaaaa\nbbbbbb'

The regular expression looks for \r optionally followed by \n and replaces it by \n. As you see it covers all cases. It only needs 1 pass too. From my benches, it seems that a mere double bytes.replace like John's answer is much faster, though.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
4
b'\n'.join(bytestr.splitlines())

The bytes.splitlines() built-in seems a bit safer and faster than multiple bytes.replace() calls:

bytestr = b'A sentence\rextending over\r\nmultiple lines.\n'

timeit b'\n'.join(bytestr.splitlines())
385 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

timeit bytestr.replace(b'\r\n', b'\n').replace(b'\r', b'\n')
457 ns ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

It has the added advantage of being more future-proof, in case the "universal newline" behaviour changes again in future versions of Python.

It drops the final newline at the end (if any), though.

user124114
  • 8,372
  • 11
  • 41
  • 63
  • 2
    Regarding the final newline: append a dummy line to the input and then remove it from the output. One byte (other than \r or \n, e.g. anything 7-bit ASCII printable is a safe bet) does the work and should be fine everywhere where `bytes.splitlines` is appropriate. (I think UTF-16 and UTF-32 are such exceptions). – VPfB Jun 28 '19 at 13:27
  • @VPfB clever! Can you post that as a solution, including timings for comparison? Any way to manage UTF16/32 as well? – user124114 Jun 28 '19 at 13:44
  • @user124114 Feel free to use that idea in your answer. I'm leaving from my computer and will not have enough time. – VPfB Jun 28 '19 at 13:56