Efficiently generate a list of all possible substrings of a string

Question

I want to write a function that efficiently returns a list of all possible substrings of a string according to a minimum and maximum length of the substrings. (The strings contain uppercase letters only.)

For example, for a String 'THISISASTRING', for min_length=3 and max_length=4, it should return:

['THI', 'THIS', 'HIS', 'HISI', 'ISI', 'ISIS', 'SIS', 'SISA', 'ISA',
 'ISAS', 'SAS', 'SAST', 'AST', 'ASTR', 'STR', 'STRI', 'TRI', 'TRIN',
 'RIN', 'RING', 'ING']

I'm looking for a solution that is much faster than my current one:

import cProfile

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

def assemble_substrings(textstring, length_min, length_max):
    str_len = len(textstring)
    subStringList = []
    idx = 0
    while idx <= str_len - length_min:
        max_depth = min(length_max, str_len - idx)
        for i in list(range(length_min, max_depth + 1)):
            subString = textstring[idx:idx + i]
            subStringList.append(subString)
        idx += 1
    return subStringList


pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = assemble_substrings(textstring=random_english_text, length_min=4, length_max=10)

pr.disable()
pr.print_stats(sort='cumtime')

which yields me:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    1.332    0.001    1.672    0.002 <input>:11(assemble_substrings)
  3654000    0.227    0.000    0.227    0.000 {method 'append' of 'list' objects}
   525000    0.112    0.000    0.112    0.000 {built-in method builtins.min}
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Now, from the output of the profiler, I don't get much insight on how to speed this function up.

What is the best way to make this function as fast as possible? Should I use a different data structure than a list? Use Cython? Or write this code in an external C/C++ shared object?

Would be highly appreciative for inputs, also generally on how to efficiently deal with strings and operations similar to the one above on them in Python.

What do you want to use this for? There may be a better data structure than a list with all the options. — RemcoGerlich, Feb 24 '19 at 17:55
Also I think you have a bug with an edge case, when the loop is at the last length_min characters of the string it still adds the string for each of the allowed lengths. I think the end of the string will need to be treated as a special case. — RemcoGerlich, Feb 24 '19 at 17:57
for all the items in the resulting list (or other data structure), i want to be able to: 1. check for each of them, if they are contained in another list (or other data structure) 2. iterate over all elements and use them as keys to access a dictionary — streamlead, Feb 24 '19 at 17:59
Your problem reminds me part of other problem although I can't remember which exactly. Maybe [k-nucleotide](https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/knucleotide.html#knucleotide)? — Alex Yu, Feb 24 '19 at 18:01
Why are you using `list(range(...))`? Seems like the call to `list()` is unnecessary. — John Gordon, Feb 24 '19 at 18:06

Patrick Artner · Answer 1 · 2019-02-24T18:20:28.687

Why not simply use a list-comprehension over 2 ranges and string slicing?

t = "SOMETEXT"

print(t)

minl = 3
maxl = 8

parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]

print(parts)

Output:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'METEXT', 'METEXT', 'ETE', 'ETEX', 
 'ETEXT', 'ETEXT', 'ETEXT', 'ETEXT', 'TEX', 'TEXT', 'TEXT', 'TEXT', 'TEXT', 'TEXT']

You can use a set to remove dupes if order is not important - else create a unique list for in-order storage:

nodupes = [] 
k = set() 
for l in parts:
    if l in k:
        pass
    else:
        nodupes.append(l)
        k.add(l)

print(nodupes)

Output:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'ETE', 'ETEX', 'ETEXT', 'TEX', 'TEXT']

With timings:

def doit(t,minl,maxl):
    parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]
    return parts

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = doit(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

         3001 function calls in 0.597 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.001    0.000    0.597    0.001 main.py:10(doit)
     1000    0.596    0.001    0.596    0.001 main.py:11(<listcomp>)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Yours gives: 4181001 function calls in 1.614 seconds

Thank you so much for your answer! Needless to say, your answer is way more elegant than mine. :) Regarding speed, your solution is about 3 times faster than mine, which is already good but it would still be great if it could be much, much faster :) — streamlead, Feb 24 '19 at 18:19

score 0 · Answer 2 · answered Feb 24 '19 at 18:59

I am not sure how fast it would actually be (read: need further investigation), but for me it sounds like task for regular expressions (re module), I would do it following way:

import re
minlen = 3
maxlen = 4
s = 'THISISASTRING'
out = []
for i in range(minlen,maxlen+1):
    p = re.compile('(?=(.{'+str(i)+'}))',re.DOTALL)
    out = out+p.findall(s)
print(out)

Output:

['THI', 'HIS', 'ISI', 'SIS', 'ISA', 'SAS', 'AST', 'STR', 'TRI', 'RIN', 'ING', 'THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING']

I used bernie answer from this topic to get findall to work in overlapping fashion. I know that this particular Zero Length Assertion could harness variable length patterns however when I did re.findall('(?=(.{3,4}))','THISISASTRING') it produced ['THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING', 'ING'] which is not desired output. Thus I present mixed for-re solution, with each turn of loop for particular length of string. There I must admit, that I am not good enough at re, to make it work in single-pass manner (solely re, without for), however maybe some other user will be able to accomplish it?

Thx for your answer, just checked and your solution is about 3 times faster than mine. However, I was rather hoping for a more drastic speedup. :) — streamlead, Feb 24 '19 at 19:24

Mykola Zotko · Answer 3 · 2019-02-24T21:10:13.290

You can map ’’.join() to zipped strings:

def func(s, min_l, max_l):
    return [subl for i in range(min_l, max_l + 1)
                 for subl in map(''.join, zip(*[s[i:] for i in range(i)]))]

Profile:

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = func(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

Output:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000    0.002    0.000    0.772    0.001 Untitled.py:3(func)
  7000    0.014    0.000    0.014    0.000 Untitled.py:4(<listcomp>)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Efficiently generate a list of all possible substrings of a string

3 Answers3

Profile:

Output: