Stemming full strings on Python

Question

I need to perform stemming on portuguese strings. To do so, i'm tokening the string using nltk.word_tokenize() function a then stemming each word individually. After that, I rebuild the string. It's working, but not performing well. How can i make it faster? The string length is about 2 million words.

    tokenAux=""
    tokens = nltk.word_tokenize(portugueseString)
        for token in tokens:
            tokenAux = token
            tokenAux = stemmer.stem(token)    
            textAux = textAux + " "+ tokenAux
    print(textAux)

Sorry for bad english and thanks!

niraj · Accepted Answer · 2017-07-19T03:17:40.613

string is immutable so, it is not good practice to update string every time if the string is long. The link here explains various ways to concatenate string and shows performance analysis. And since, the iteration is done only once, it is good to choose generator expression over list comprehension. For details you can look into discussion here . Instead in this case, using generator expression with join can be helpful:

Using my_text for long string: len(my_text) -> 444399

Using timeit to compare:

%%timeit
tokenAux=""
textAux=""
tokens = nltk.word_tokenize(my_text)
for token in tokens:
    tokenAux = token
    tokenAux = stemmer.stem(token)    
    textAux = textAux + " "+ tokenAux

Result:

1 loop, best of 3: 6.23 s per loop

Using generator expression with join:

%%timeit 
' '.join(stemmer.stem(token) for token in nltk.word_tokenize(my_text))

Result:

1 loop, best of 3: 2.93 s per loop

@yuridamata Great! `Happy Coding.` – niraj Jul 20 '17 at 02:58 — niraj, Jul 20 '17 at 02:58

score 1 · Answer 2 · answered Jul 19 '17 at 02:15

String objects are immutable in Python. Look into your code:

textAux = ""
for token in tokens:
    # something important ...
    textAux = textAux + " "+ tokenAux

Every time you create a new string in a loop and assign it to textAux variable. This is not efficient.

I would store tokenAux elements in a list and just join them in the very end. See the example:

tokenAux = []  # we declare list for storing tokens
tokens = nltk.word_tokenize(portugueseString)
for token in tokens:
    tokenAux = token
    tokenAux = stemmer.stem(token)    
    textAux.append(tokenAux)  # we add new token into the resulting list

result = " ".join(textAux)  # join list using space as separator
print(result)

Compare the performance and share it with us :)

Useful links:

score 0 · Answer 3 · answered Jul 19 '17 at 01:17

0

You could read the string in as a text file and then perform the necessary operations to stem each word using PySpark. This will allow you to perform your operations in parallel.

You can also use the multiprocessing module.

answered Jul 19 '17 at 01:17

Does PySpark really have a built-in Portuguese stemmer? – lenz Jul 19 '17 at 10:55
I'm not sure. There's no reason why PySpark couldn't be used in conjunction with NLTK. @lenz – Jul 19 '17 at 11:35

Stemming full strings on Python

3 Answers3