0

I'm working with some text that has a mix of languages, which I've already done some processing on and is in the form a list of single characters (called "letters"). I can tell which language each character is by simply testing if it has case or not (with a small function called "test_lang"). I then want to insert a space between characters of different types, so I don't have any words that are a mix of character types. At the same time, I want to insert a space between words and punctuation (which I defined in a list called "punc"). I wrote a script that does this in a very straight-forward way that made sense to me (below), but apparently is the wrong way to do it, because it is incredibly slow.

Can anyone tell me what the better way to do this is?

# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ""
i = 0
while i <= len(letters)-2: #range excludes last letter to avoid Out of Range error for i+1
    cleaned += letters[i]
    # words that have case are Latin; otherwise Arabic
    if test_lang(letters[i]) != test_lang(letters[i+1]):
        cleaned += " "
    if letters[i] in punc or letters[i+1] in punc:
        cleaned += " "
    i += 1
cleaned += letters[len(letters)-1] # add in last letter
larapsodia
  • 594
  • 4
  • 15

6 Answers6

4

There are a few things going on here:

  • You call test_lang() on every letter in the string twice, this is probably the main reason this is slow.
  • Concatenating strings in Python isn't very efficient, you should instead use a list or generator and then use str.join() (most likely, ''.join()).

Here is the approach I would take, using itertools.groupby():

from itertools import groupby
def keyfunc(letter):
    return (test_lang(letter), letter in punc)

cleaned = ' '.join(''.join(g) for k, g in groupby(letters, keyfunc))

This will group the letters into consecutive letters of the same language and whether or not they are punctuation, then ''.join(g) converts each group back into a string, then ' '.join() combines these strings adding a space between each string.

Also, as noted in comments by DSM, make sure that punc is a set.

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • I'm finding this to be only about 15% faster than the original solution. An improvement, no doubt, but it seems like we ought to be able to do much better. – Aaron Dufour Jan 25 '13 at 20:33
  • Take that back; testing shows that python is doing something clever to keep the original solution linear rather than the quadratic you'd expect. That said, you're missing the punctuation case so this isn't quite satisfactory. – Aaron Dufour Jan 25 '13 at 20:37
  • @AaronDufour - Punctuation case is handled. – Andrew Clark Jan 25 '13 at 20:38
  • 1
    @AaronDufour: String concatenation using `+` [has been optimized](http://stackoverflow.com/a/4435752/20670) since Python 2.5; strings are now extended in-place, if possible. Schlemiel the painter is no longer a problem :) – Tim Pietzcker Jan 25 '13 at 20:47
  • Clever! I made an edit to capture the consecutive-punctuation case. Please let me know if this doesn't seem right for any reason. – Aaron Dufour Jan 25 '13 at 20:47
  • @TimPietzcker Thanks for the source! My tests indicated as such, but I hadn't found a reason why it was coming out that way. – Aaron Dufour Jan 25 '13 at 20:49
  • I think I may have been misinterpreting the original code, I was thinking that you would not want to insert spaces between consecutive punctuation but it looks like that is what OP wants to do. – Andrew Clark Jan 25 '13 at 20:54
  • Whether the punctuation is together or separated doesn't matter too much to me, as long as it's separated from the text. @F.J. - I can't get this code to work in order to test it. When I try in in the shell, I keep getting this error: `UnboundLocalError: local variable 'counter' referenced before assignment` Obviously, I defined `counter=0` just as you did above, so I don't really understand what's going on here ... ? – larapsodia Jan 25 '13 at 22:35
  • @larapsodia - Sorry, didn't pay enough attention to the edits, try now. This will not insert the space between consecutive whitespace. – Andrew Clark Jan 25 '13 at 23:01
  • As for why you are seeing that error, see [this answer](http://stackoverflow.com/a/146365/505154) to another question. It has to do with how Python treats assignment within a function, you can only assign to a global function if you use the `global` keyword. – Andrew Clark Jan 25 '13 at 23:04
  • Doy! I knew that... Thanks. – larapsodia Jan 26 '13 at 00:20
  • Just tried it out, integrated into my whole preprocessing script, and it works beautifully! It only took a couple of minutes to process the whole corpus, as opposed to over half an hour before. So this was definitely the bottleneck, thank you @FJ! – larapsodia Jan 26 '13 at 01:23
  • @larapsodia Based on the speed-up you saw, I'm guessing you're using Python < 2.5? It does sound like you were seeing Schlemiel behavior. – Aaron Dufour Jan 27 '13 at 20:08
2

Every time you perform a string concatenation, a new string is created. The longer the string gets, the longer each concatenation takes.

http://en.wikipedia.org/wiki/Schlemiel_the_Painter's_algorithm

You might be better off declaring a list big enough to store the characters of the output, and joining them at the end.

recursive
  • 83,943
  • 34
  • 151
  • 241
  • 2
    "Declaring a list big enough..." is not something you do in Python. You let the list grow dynamically. – Tim Pietzcker Jan 25 '13 at 20:20
  • Your diagnosis is correct, but without a usable suggestion for fixing it this answer isn't terribly helpful. – Aaron Dufour Jan 25 '13 at 20:31
  • 1
    Actually, some simple testing shows that the original solution scales linearly with the size of `letters`, whereas a schlemiel's algorithm would scale quadratically. It seems that python is doing something clever here. – Aaron Dufour Jan 25 '13 at 20:36
  • 1
    CPython optimizes the string += operation for this common case. It knows the string is only referenced once, and so can modify the string in place instead of creating a new string. – Ned Batchelder Jan 25 '13 at 21:04
  • @AaronDufour: I'm not sure why you think that my proposed solution is not usable. – recursive Jan 25 '13 at 21:44
  • @recursive See Tim Pietzcker's comment. – Aaron Dufour Jan 25 '13 at 21:46
  • @NedBatchelder This is true of Python 2.5 and up as well. This is not a Schlemiel problem after all, unless OP is on an old version. – Aaron Dufour Jan 25 '13 at 21:47
  • @AaronDufour: I'm not sure I understand the comment. Here's how I'd declare a list size: `chars = [None] * capacity` – recursive Jan 25 '13 at 22:06
  • 1
    @recursive Since you can append to a list in amortized O(1), that doesn't really help. – Aaron Dufour Jan 26 '13 at 00:11
1

I suggest an entirely different solution that should be very fast:

import re
cleaned = re.sub(r"(?<!\s)\b(?!\s)", " ", letters, flags=re.LOCALE)

This inserts a space at every word boundary (defining words as "sequences of alphanumeric characters, including accented characters in your current locale", which should work in most cases), unless it's a word boundary next to whitespace.

This should split between Latin and Arabic characters as well as between Latin and punctuation.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Not sure this handles the mixed language issue. – Andrew Clark Jan 25 '13 at 20:28
  • Yeah, this logic won't be even remotely similar to what OP is looking for. – Aaron Dufour Jan 25 '13 at 20:29
  • @AaronDufour: I'm not sure either; the comment about "Latin, otherwise Arabic" gave me the impression that that's what he's trying to separate. I may be wrong about that. – Tim Pietzcker Jan 25 '13 at 20:31
  • @Tim - Yes, you're right, separating Arabic and Latin characters is what _she_ is trying to do. :-) So I don't believe this solution will work. – larapsodia Jan 25 '13 at 20:45
  • @larapsodia: Did you give it a try? That's exactly the situation where it *should* work. – Tim Pietzcker Jan 25 '13 at 20:49
  • @Tim - Yes, this does work! At least, it separates the Arabic and Latin (though I must admit that I don't quite understand how it works, what does the re.LOCALE do exactly?). It doesn't seem to separate the punctuation, though. – larapsodia Jan 25 '13 at 21:54
  • @larapsodia: `\b` usually matches at the start/end of ASCII alphanumerics which is fine for English words but fails with accented characters (`élève` would be split into `é`, `l`, `è` and `ve`). `re.LOCALE` takes the current locale into account. – Tim Pietzcker Jan 25 '13 at 21:58
  • @Tim - Ah, I see. Thank you for the explanation. Is there any way to modify this so that it works for the punctuation as well? – larapsodia Jan 25 '13 at 22:23
  • It seems to work with punctuation for me... although it also inserts and extra space at the beginning and end of the string. – Karl Knechtel Jan 25 '13 at 22:58
  • That's strange. This is the output that I get: ` أشياء أخرى من الـ casier (مشط ومرآة وصابونة!..)` Sorry, the formatting is all jacked up , but the important thing is that it separated the `الـ` which was attached to the `casier`, but didn't affect the punctuation. – larapsodia Jan 26 '13 at 00:57
0

Assuming test_lang is not the bottleneck, I'd try:

''.join(
    x + ' '
    if x in punc or y in punc or test_lang(x) != test_lang(y)
    else x
    for x, y in zip(letters[:-1], letters[1:])
)
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
0

Here is a solution that uses yield. I would be interested to know whether this runs any faster than your original solution.

This avoids all the indexing in the original. It just iterates through the input, holding onto a single previous character.

This should be easy to modify if your requirements change in the future.

ch_sep = ' '

def _sep_chars_by_lang(s_input):
    itr = iter(s_input)
    ch_prev = next(itr)

    yield ch_prev

    while True:
        ch = next(itr)
        if test_lang(ch_prev) != test_lang(ch) or ch_prev in punc:
            yield ch_sep
        yield ch
        ch_prev = ch

def sep_chars_by_lang(s_input):
    return ''.join(_sep_chars_by_lang(s_input))
steveha
  • 74,789
  • 21
  • 92
  • 117
0

Keeping the basic logic of the OP's original code, we speed it up by not doing all that [i] and [i+1] indexing. We use a prev and next reference that scan through the string, maintaining prev one character behind next:

# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ''
prev = letters[0]
for next in letters[1:]:
    cleaned += prev
    if test_lang(prev) != test_lang(next):
        cleaned += ' '
    if prev in punc or next in punc:
        cleaned += ' '
    prev = next
cleaned += next

Testing on a string of 10 million characters shows this is about twice the speed of the OP code. The "string concatenation is slow" complaint is obsolete, as others have pointed out. Running the test again using the ''.join(...) metaphor shows a slighly slower execution than using string concatenation.

Further speedup may come through not calling the test_lang() function but by inlining some simple code. Can't comment as I don't really know what test_lang() does :).

Edit: removed a 'return' statement that should not have been there (testing remnant!).

Edit: Could also speedup by not calling test_lang() twice on the same character (on next in one loop and then prev in the following loop). Cache the test_lang(next) result.

rzzzwilson
  • 141
  • 5