I want to write a function that efficiently returns a list of all possible substrings of a string according to a minimum and maximum length of the substrings. (The strings contain uppercase letters only.)
For example, for a String 'THISISASTRING'
, for min_length=3
and max_length=4
, it should return:
['THI', 'THIS', 'HIS', 'HISI', 'ISI', 'ISIS', 'SIS', 'SISA', 'ISA',
'ISAS', 'SAS', 'SAST', 'AST', 'ASTR', 'STR', 'STRI', 'TRI', 'TRIN',
'RIN', 'RING', 'ING']
I'm looking for a solution that is much faster than my current one:
import cProfile
random_english_text = \
'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'
def assemble_substrings(textstring, length_min, length_max):
str_len = len(textstring)
subStringList = []
idx = 0
while idx <= str_len - length_min:
max_depth = min(length_max, str_len - idx)
for i in list(range(length_min, max_depth + 1)):
subString = textstring[idx:idx + i]
subStringList.append(subString)
idx += 1
return subStringList
pr = cProfile.Profile()
pr.enable()
for i in range(0, 1000):
list_of_substrings = assemble_substrings(textstring=random_english_text, length_min=4, length_max=10)
pr.disable()
pr.print_stats(sort='cumtime')
which yields me:
ncalls tottime percall cumtime percall filename:lineno(function)
1000 1.332 0.001 1.672 0.002 <input>:11(assemble_substrings)
3654000 0.227 0.000 0.227 0.000 {method 'append' of 'list' objects}
525000 0.112 0.000 0.112 0.000 {built-in method builtins.min}
1000 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Now, from the output of the profiler, I don't get much insight on how to speed this function up.
What is the best way to make this function as fast as possible? Should I use a different data structure than a list? Use Cython? Or write this code in an external C/C++ shared object?
Would be highly appreciative for inputs, also generally on how to efficiently deal with strings and operations similar to the one above on them in Python.