2

I want to write a function that efficiently returns a list of all possible substrings of a string according to a minimum and maximum length of the substrings. (The strings contain uppercase letters only.)

For example, for a String 'THISISASTRING', for min_length=3 and max_length=4, it should return:

['THI', 'THIS', 'HIS', 'HISI', 'ISI', 'ISIS', 'SIS', 'SISA', 'ISA',
 'ISAS', 'SAS', 'SAST', 'AST', 'ASTR', 'STR', 'STRI', 'TRI', 'TRIN',
 'RIN', 'RING', 'ING']

I'm looking for a solution that is much faster than my current one:

import cProfile

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

def assemble_substrings(textstring, length_min, length_max):
    str_len = len(textstring)
    subStringList = []
    idx = 0
    while idx <= str_len - length_min:
        max_depth = min(length_max, str_len - idx)
        for i in list(range(length_min, max_depth + 1)):
            subString = textstring[idx:idx + i]
            subStringList.append(subString)
        idx += 1
    return subStringList


pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = assemble_substrings(textstring=random_english_text, length_min=4, length_max=10)

pr.disable()
pr.print_stats(sort='cumtime')

which yields me:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    1.332    0.001    1.672    0.002 <input>:11(assemble_substrings)
  3654000    0.227    0.000    0.227    0.000 {method 'append' of 'list' objects}
   525000    0.112    0.000    0.112    0.000 {built-in method builtins.min}
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Now, from the output of the profiler, I don't get much insight on how to speed this function up.

What is the best way to make this function as fast as possible? Should I use a different data structure than a list? Use Cython? Or write this code in an external C/C++ shared object?

Would be highly appreciative for inputs, also generally on how to efficiently deal with strings and operations similar to the one above on them in Python.

martineau
  • 119,623
  • 25
  • 170
  • 301
streamlead
  • 21
  • 1
  • 2
  • What do you want to use this for? There may be a better data structure than a list with all the options. – RemcoGerlich Feb 24 '19 at 17:55
  • Also I think you have a bug with an edge case, when the loop is at the last length_min characters of the string it still adds the string for each of the allowed lengths. I think the end of the string will need to be treated as a special case. – RemcoGerlich Feb 24 '19 at 17:57
  • for all the items in the resulting list (or other data structure), i want to be able to: 1. check for each of them, if they are contained in another list (or other data structure) 2. iterate over all elements and use them as keys to access a dictionary – streamlead Feb 24 '19 at 17:59
  • Your problem reminds me part of other problem although I can't remember which exactly. Maybe [k-nucleotide](https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/knucleotide.html#knucleotide)? – Alex Yu Feb 24 '19 at 18:01
  • 1
    Why are you using `list(range(...))`? Seems like the call to `list()` is unnecessary. – John Gordon Feb 24 '19 at 18:06

3 Answers3

6

Why not simply use a list-comprehension over 2 ranges and string slicing?

t = "SOMETEXT"

print(t)

minl = 3
maxl = 8

parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]

print(parts)

Output:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'METEXT', 'METEXT', 'ETE', 'ETEX', 
 'ETEXT', 'ETEXT', 'ETEXT', 'ETEXT', 'TEX', 'TEXT', 'TEXT', 'TEXT', 'TEXT', 'TEXT']

You can use a set to remove dupes if order is not important - else create a unique list for in-order storage:

nodupes = [] 
k = set() 
for l in parts:
    if l in k:
        pass
    else:
        nodupes.append(l)
        k.add(l)

print(nodupes)   

Output:

['SOM', 'SOME', 'SOMET', 'SOMETE', 'SOMETEX', 'SOMETEXT', 'OME', 'OMET', 'OMETE', 'OMETEX', 
 'OMETEXT', 'MET', 'METE', 'METEX', 'METEXT', 'ETE', 'ETEX', 'ETEXT', 'TEX', 'TEXT']

With timings:

def doit(t,minl,maxl):
    parts = [t[i:i+j] for i in range(len(t)-minl) for j in range(minl,maxl+1)]
    return parts

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = doit(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

         3001 function calls in 0.597 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.001    0.000    0.597    0.001 main.py:10(doit)
     1000    0.596    0.001    0.596    0.001 main.py:11(<listcomp>)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Yours gives: 4181001 function calls in 1.614 seconds

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • 1
    Thank you so much for your answer! Needless to say, your answer is way more elegant than mine. :) Regarding speed, your solution is about 3 times faster than mine, which is already good but it would still be great if it could be much, much faster :) – streamlead Feb 24 '19 at 18:19
0

I am not sure how fast it would actually be (read: need further investigation), but for me it sounds like task for regular expressions (re module), I would do it following way:

import re
minlen = 3
maxlen = 4
s = 'THISISASTRING'
out = []
for i in range(minlen,maxlen+1):
    p = re.compile('(?=(.{'+str(i)+'}))',re.DOTALL)
    out = out+p.findall(s)
print(out)

Output:

['THI', 'HIS', 'ISI', 'SIS', 'ISA', 'SAS', 'AST', 'STR', 'TRI', 'RIN', 'ING', 'THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING']

I used bernie answer from this topic to get findall to work in overlapping fashion. I know that this particular Zero Length Assertion could harness variable length patterns however when I did re.findall('(?=(.{3,4}))','THISISASTRING') it produced ['THIS', 'HISI', 'ISIS', 'SISA', 'ISAS', 'SAST', 'ASTR', 'STRI', 'TRIN', 'RING', 'ING'] which is not desired output. Thus I present mixed for-re solution, with each turn of loop for particular length of string. There I must admit, that I am not good enough at re, to make it work in single-pass manner (solely re, without for), however maybe some other user will be able to accomplish it?

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Thx for your answer, just checked and your solution is about 3 times faster than mine. However, I was rather hoping for a more drastic speedup. :) – streamlead Feb 24 '19 at 19:24
0

You can map ’’.join() to zipped strings:

def func(s, min_l, max_l):
    return [subl for i in range(min_l, max_l + 1)
                 for subl in map(''.join, zip(*[s[i:] for i in range(i)]))]

Profile:

random_english_text = \
    'AHOUSEISABUILDINGTHATISMADEFORPEOPLETOLIVEINITISAPERMANENTBUILDINGTHATISMEANTTOSTAYSTANDINGITISNOTEASILYPACKEDU' \
    'PANDCARRIEDAWAYLIKEATENTORMOVEDLIKEACARAVANIFPEOPLELIVEINTHESAMEHOUSEFORMORETHANASHORTSTAYTHENTHEYCALLITTHEIRHO' \
    'MEBEINGWITHOUTAHOMEISCALLEDHOMELESSNESSHOUSESCOMEINMANYDIFFERENTSHAPESANDSIZESTHEYMAYBEASSMALLASJUSTONEROOMORTH' \
    'EYMAYHAVEHUNDREDSOFROOMSTHEYALSOAREMADEMANYDIFFERENTSHAPESANDMAYHAVEJUSTONELEVELORSEVERALDIFFERENTLEVELSAHOUSEI' \
    'SSOMETIMESJOINEDTOOTHERHOUSESATTHESIDESTOMAKEATERRACEORROWHOUSEACONNECTEDROWOFHOUSES'

pr = cProfile.Profile()
pr.enable()

for i in range(0, 1000):
    list_of_substrings = func(random_english_text, 4, 10)

pr.disable()
pr.print_stats(sort='cumtime')

Output:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000    0.002    0.000    0.772    0.001 Untitled.py:3(func)
  7000    0.014    0.000    0.014    0.000 Untitled.py:4(<listcomp>)
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73