5

I have a simple problem. I have a number of text files where words have been divided (hyphenated) at the end of lines. Something like this:

toward an emotionless evalu-
ation of objectively gained

I would like to get rid of the hyphenation and join the words again. This could be done simply and quickly using the replace() function. However in some cases there are a few extra line breaks after the hyphen. Like this:

end up as a first rate con-


tribution, but that was not

Rather than pile up several calls to replace(), I've just switched to regular expressions and used re.sub('\-\n+', '', text):

def replace_hyphens(text):
    return re.sub('\-\n+', '', text)

This works quite well but I was wondering how I could achieve the same result with a function coded directly in Python. This is what I came up with:

def join_hyphens(text):
    processed = ''
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed += text[i]
        i += 1
    return processed

But of course the performance is abysmal compared to regular expressions. If I time it over 100 iterations on a rather long string here are the results.

join_hyphens done in 2.398ms
replace_hyphens done in 0.021ms

What would be the best way to improve performance while still using native Python code ?


Edit: Switching to a list, as suggested, significantly improves performance but still fares poorly compared to regular expressions :

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return ''.join(processed)

Gives:

    join_hyphens done in 1.769ms
    replace_hyphens done in 0.020ms
user2969402
  • 1,221
  • 3
  • 16
  • 26

6 Answers6

5
processed += text[i]

is very slow when processed becomes big. Strings are immutable, so in-place addition is just an illusion. It's not done in place.

There are several alternatives, a simple one is to build a list then use str.join:

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return "".join(processed)

join pre-computes the required space for the string, allocates (in one go) and joins the strings. Everything is done using compiled kernel of python, so it's very fast.

(unfortunately, the native python loops of your code slow the program down, regular expressions use compiled code and no native python loops, which explains it's much faster. str.join is very useful in other contexts, but the current problem is solved faster by several other answers here)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
5

A bit late to the party but no matter... Everything in the Python Standard Library is kind of considered native Python, as it should be available on any Python system, so that also includes the re module.

However, if you insist doing it in Python alone, instead of iterating over characters one by one, you can employ native text search to skip large swaths of text. That should improve performance quite a bit and in certain cases even beat regex. Of course, string concatenation through "".join() is also much more preferred as others have stated:

def join_hyphens(text):
    pieces = []  # a simple chunk buffer store
    head = 0  # our current search head
    finder = text.find  # optimize lookup for str.find
    add_piece = pieces.append  # optimize lookup for list.append
    while True:
        index = finder("-\n", head)  # find the next hyphen
        if index >= 0:  # check if a hyphen was found
            add_piece(text[head:index])  # add the current chunk
            head = index + 2  # move the search head for after the find
            while text[head] == "\n":  # skip new line characters
                head += 1
        else:
            add_piece(text[head:])  # add the last chunk
            break
    return "".join(pieces)  # join the chunks and return them

And to test it:

text = """end up as a first rate con-


tribution, but that was not"""

print(join_hyphens(text))  # end up as a first rate contribution, but that was not
zwer
  • 24,943
  • 3
  • 48
  • 66
  • This is also very fast- see the [timing results](https://stackoverflow.com/a/49658864/5858851) I posted. – pault Apr 04 '18 at 19:48
  • @pault - thanks! I was about to do a test suite as well. Can you retest the above, I added a few micro-optimizations to it? – zwer Apr 04 '18 at 19:54
  • I updated the timing results- had to update for everyone since I overwrote my random data. Your optimizations seem to have worked! – pault Apr 04 '18 at 19:59
  • perfect. this is almost as fast as regular expressions. – user2969402 Apr 04 '18 at 21:20
4

Building a string with += makes it O(n**2). Making a list of pieces and joining them in O(n) and faster for any substantial text.

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return ''.join(processed)

EDIT: without a sample, untested. But this is a standard idiom. EDIT2: corrected syntax error

Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
3

Try:

def join_hyphens(text):
    while "-\n\n" in text:
        text = text.replace("-\n\n", "-\n")

    return text.replace("-\n", "")

This will still create multiple strings, but less then your method as it creates one copy of the string per max occurence of -\n\n + 1 to remove all -\n from it.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
3

Another option:

def join_hyphens(text):
    return "\n".join([t for t in text.split("\n") if t]).replace("-\n", "")

Split the text on \n and then use a list comprehension to remove the empty lines. Then join it back together using \n and do a replace.

This is fast, but it will have the side effect of removing all blank lines.


Update: Timing results

First build a random dataset:

import numpy as np
p1 = 0.25
p2 = 0.25
NLines = 100
text = "\n".join(
    [
        " ".join(
            [
                "".join(
                    [
                        np.random.choice(list(string.letters)) 
                        for _ in range(np.random.randint(1,10))
                    ]
                ) 
                for _ in range(np.random.randint(1,10))
            ]
        )
        + ("-" if np.random.random() < p1 else "") 
        + "".join(["\n" for _ in range(np.random.randint(1,4)) if np.random.random() < p2])
        for _ in range(NLines)
    ]
) + "this is the last line"

Results:

%%timeit
replace_hyphens(text)
#100000 loops, best of 3: 8.1 µs per loop

%%timeit
join_hyphens(text)
#1000 loops, best of 3: 601 µs per loop

%%timeit
join_hyphens_pault(text)
#100000 loops, best of 3: 17.7 µs per loop

%%timeit
join_hyphens_terry(text)
#1000 loops, best of 3: 661 µs per loop

%%timeit
join_hyphens_jean(text)
#1000 loops, best of 3: 653 µs per loop

%%timeit
join_hyphens_patrick(text)
#100000 loops, best of 3: 10.1 µs per loop

%%timeit
join_hyphens_zwer(text)
#100000 loops, best of 3: 14.4 µs per loop
pault
  • 41,343
  • 15
  • 107
  • 149
1

I think part of the abysmal performance is that you keep creating new strings, because strings are immutable in python. So when you do

processed += text[i]

a new string of size processed + 1 is allocated. You want to avoid that allocation to be faster, so you transform the string into a list of char, and mutate that. Ideally you compute the space you will need and prefill the output list to avoid unnecessary allocations.

midor
  • 5,487
  • 2
  • 23
  • 52