how to make this function faster?

Question

I'm new to python and I would like to make this function faster.

this function get a string as a parameter and in output gives back a list of SE (sound element).

A 'sound element' (SE) is a maximal sequence of 1 or more consonants followed by 1 or more vowels:

first all the consonants
then all the vowels (aeioujy)
all non-alphabetic chars like spaces, numbers, colon, comma etc. must be ignored
all accents from accented letters (e.g. è->e) must be removed
differences between uppercase and lowercase letters are disregarded

NOTICE: the only exceptions are the first and the last SE of a verse, that could contain only vowels and only consonants, respectively.

Example:

If the verse is "Donàld Duck! wènt, to tHe seas'ìde to swim"

the SEs are [ 'do', 'na', 'lddu', 'ckwe', 'ntto', 'the', 'sea', 'si', 'de', 'to', 'swi', 'm' ]

def es_converter(string):
    
    
    vowels, li_es, container = ['a', 'e', 'i', 'o', 'u', 'y', 'j'], [] , ''

    #check for every element in the string
    for i in range(len(string)):
        #i is a vowel?
        if not string[i] in vowels:
            # No, then add it in the variable container
            container += string[i]
            # is the last element of the list?
            if i == (len(string) - 1):
                #yes, then insert inside the list li_es, container and then set it back to ''
                li_es.append(container)
                container = ''
            if string[i] == (len(string) - 1):
                li.append(container)
                container = ''
        #if it was the last element, we check if there are other values after i and are vowels
        elif i < (len(string)-1) and string[i+1] in vowels:
            #yes, add in container
            container += string[i]
        else:
            #no, add in container, append container on the list li_es, set container to '' 
            container += string[i]
            li_es.append(container)
            container = ''
    return li_es

Thanks for all the suggestions! (Unfortunately I can't use any imports)

"make this function faster" - Is is critically slow now? Did you do some timing on sample data? How much time did it take? What is the size of your data? What would you expect? — Thierry Lathuille, Nov 15 '20 at 09:14
Hi! Thanks for you answer! Unfortunately this is part of a larger piece of code. And after running, the time for this function is 0.599 and i need to decrease it at least to 0.100 and I don't know how. — Dan2783, Nov 15 '20 at 09:20
Please provide the expected [MRE](https://stackoverflow.com/help/minimal-reproducible-example). Show where the intermediate results deviate from the ones you expect. We should be able to paste a single block of your code into file, run it, and reproduce your problem. This also lets us test any suggestions in your context. Your posted code simply defines a function and quits: it does not print timing data for a test set. — Prune, Nov 15 '20 at 09:42
It will help us a lot if you use meaningful variable names. Also outline your algorithm. Where is your code spending most of its time? Use a profiler. — Prune, Nov 15 '20 at 09:46

Thierry Lathuille · Accepted Answer · 2020-11-15T10:08:30.083

A big source of inefficiency in your current code is that you use indices all along when iterating on your string. Rather than:

for i in range(len(data)):
    x = data[i]
    ...
    if data[i] == ...

you should always do:

for char in data:
    x = char
    ...
    if char == ...

and if you really need indices at some point, use enumerate:

for i, char in enumerate(data):
    ...

and only use the indices when really needed.

I would rather use a regex here, though. Without sample data, I can't time it, but I'm certain that it would be much faster than using Python loops.

The process is:

remove all non alphabetic characters
make the string lowercase
remove the accents, which your current code doesn't do
split the string using a regex that describes your conditions.

So, you could do:

import re
import unicodedata

# from https://stackoverflow.com/a/44433664/550094
def strip_accents(text):
    return  unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")

    

def se(data):
    # keep only alphabetical characters
    data = re.sub(r'\W', '', data)
    # make lowercase
    data = data.casefold()
    # strip accents from the remaining data
    data = strip_accents(data)

    # creating the regex: 
    #  - start of the string followed by vowels, or
    #  - consonants followed by vowels, or
    #  - consonants followed by end of string
    vowels = 'aeiouy'
    se_regex = re.compile(rf'^[{vowels}]+|[^{vowels}]+[{vowels}]+|[^{vowels}]+$')
    
    # return the SEs
    return se_regex.findall(data)

Sample run (I added a vowel at the start of your string to test this case):

data = "A Donàld Duck! wènt, to tHe seas'ìde to swim"
print(se(data))
# ['a', 'do', 'na', 'lddu', 'ckwe', 'ntto', 'the', 'sea', 'si', 'de', 'to', 'swi', 'm']

@MadPhysicist Thanks, that might be a good idea, the current version with lowercase deletes the ß... How can this not be considered a normal letter! — Thierry Lathuille, Nov 15 '20 at 10:08
Thanks a lot for your comment! I really appreciate the time and the suggestions. I will improve my code, unfortunately I knew using the imports will simplify the workload but I can't. I would like to thank everyone that took the time to answer, as I'm new to coding, I really need to improve. Thanks for the patience! — Dan2783, Nov 15 '20 at 10:47

how to make this function faster?

1 Answers1