0

I have 2 lists: one of strings and one of substrings

list_strings = ['applepearnutapplepear', 'pearappleapplepearnut', 'nutapplepearnutnut']

list_ss = ['ple', 'ear']

I want to search for every substring in list of strings and if found I want to save its position as index in a dictionary like this:

{
    substring1: [index, index etc.],
    substring2: [index, index etc.]
}

dict_found = {'ple': [2, 14]  #index position for every found item, and so on for every substring}
JonSG
  • 10,542
  • 2
  • 25
  • 36
mod13
  • 25
  • 6

5 Answers5

1

This is just an extra layer layer around [SO]: How to find all occurrences of a substring?.
Any answer there, can (easily) be wrapped to solve the current problem using a dictionary comprehension nesting a list comprehension (both described in [Python.Docs]: Built-in Types).

There are multiple possible answers (as stated) each with its pros and cons (considering various criteria like: speed, amount code, code aspect, ...).
Based on the example from [SO]: How can I invert (swap) the case of each letter in a string? (@CristiFati's answer), I took 2 of them and tried a comparison.

code00.py:

#!/usr/bin/env python

import re
import sys
import timeit
from pprint import pprint as pp


STRS = ("applepearnutapplepear", "pearappleapplepearnut", "nutapplepearnutnut", "dummy")
SUBS = ("ple", "ear", "dummy")


def find_substrings_in_strings_re(ss, subs):
    return {sub: [[m.start(0) for m in re.finditer(sub, s)] for s in ss] for sub in subs}


def indexes(s, sub):
    ret = []
    i = -1
    while True:
        i = s.find(sub, 0 if i == -1 else i + len(sub))
        if i == -1:
            return ret
        ret.append(i)


def find_substrings_in_strings_s0(ss, subs):
    return {sub: [indexes(s, sub) for s in ss] for sub in subs}


FUNCS = (
    find_substrings_in_strings_re,
    find_substrings_in_strings_s0,
)


def test_acc():
    print("Test accuracy:")
    for func in FUNCS:
        print("{:s}:".format(func.__name__))
        pp(func(STRS, SUBS), sort_dicts=False)


def test_perf(count=1000000):
    print("\nTest performance:")
    for func in FUNCS:
        print("{:s}: {:.3f}".format(func.__name__, timeit.timeit(lambda: func(STRS, SUBS), number=count)))


def main(*argv):
    test_acc()
    test_perf()


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.\n")
    sys.exit(rc)

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackExchange\StackOverflow\q076094866]> "e:\Work\Dev\VEnvs\py_pc064_03.10_test0\Scripts\python.exe" ./code00.py
Python 3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)] 064bit on win32

Test accuracy:
find_substrings_in_strings_re:
{'ple': [[2, 14], [6, 11], [5], []],
 'ear': [[6, 18], [1, 15], [9], []],
 'dummy': [[], [], [], [0]]}
find_substrings_in_strings_s0:
{'ple': [[2, 14], [6, 11], [5], []],
 'ear': [[6, 18], [1, 15], [9], []],
 'dummy': [[], [], [], [0]]}

Test performance:
find_substrings_in_strings_re: 18.569
find_substrings_in_strings_s0: 8.331

Done.
CristiFati
  • 38,250
  • 9
  • 50
  • 87
0

Assuming time complexity isnt an issue, this can be solved by:

def getSubstringIndices(strings, substrings):
    
    dict_found = {}

    for ss in substrings:
        dict_found[ss] = []
        for s in strings:
            dict_found[ss].append(s.index(ss))

    return dict_found

list_strings = ['applepearnutapplepear', 'pearappleapplepearnut', 'nutapplepearnutnut']
list_ss = ['ple', 'ear']

print(getSubstringIndices(list_strings, list_ss))
0

IIUC, you can try a combo finditer/start inside a dictcomp :

dict_found = {ss: [[m.start(0) for m in re.finditer(ss, string)]
                   for string in list_strings] for ss in list_ss}

Output :

print(dict_found)

#{'ple': [[2, 14], [6, 11], [5]], 'ear': [[6, 18], [1, 15], [9]]}
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • It works fine. Additionally I would have liked, if it is possible to, to save the indices for every substring in a separate list. For example: {'ple': [2, 14], [6, 11], [5],...} or something like this. This will be useful to me to parse the dictionary and select a sublist from values associated with a key, and go to the indices for further processing. – mod13 Apr 24 '23 at 19:33
  • Got you, I updated my answer ;) – Timeless Apr 24 '23 at 19:37
0

If you don't mind stacked comprehensions:

import re

result = {
    ss: [match.start() for string in list_strings for match in re.finditer(ss, string)]
    for ss in list_ss
}

Uses the builtin regex module. This will give you: {'ple': [2, 14, 6, 11, 5], 'ear': [6, 18, 1, 15, 9]} with your inputs.

Will add an empty list for entries with no matches i.e. {'no matches': []}

DanNunan
  • 1
  • 2
0
list_strings = ['applepearnutapplepear', 'pearappleapplepearnut', 'nutapplepearnutnut']
list_ss = ['ple', 'ear']

dict_found = {}
for sub_string in list_ss:
    dict_found[sub_string] = []
    for string in list_strings:
        last_index = 0
        while True:
            try:
                index = string.index(sub_string, last_index)
            except ValueError:
                break
            dict_found[sub_string].append(index)
            last_index = index + 1
print(dict_found)

Output:

{'ple': [2, 14, 6, 11, 5], 'ear': [6, 18, 1, 15, 9]}
phibel
  • 391
  • 1
  • 8