-1

I have two big string lists, and I need to compare each item from the second list with the first item from the first list. Then the second item, third, etc.

I've been using the following code which is fine, but when lists have ~20000 elements, it just takes too much time to process.

Substring is the second array, and list_elem is first

found = [
    elem
    for elem in list_elem
    for substring in substrings
    if re.search(r"\b" + substring + r"\b", elem, re.IGNORECASE)
]

I've been trying to implement an answer from this post, but it doesn't work in this way and I don't really know how to change it to search instead of replace.. And I'm talking about Eric's answer. Here is my code:

found_words = set(word.lower().strip() for word in substrings)
found = []
def find_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in found_words:
        return ""
    else:
        return word

word_pattern = re.compile('\w+')

for sentence in list_elem:
    sentence = word_pattern.sub(find_words, sentence)
    found.append(sentence)

return found

I mean, that there is none pattern.search() regex function.

I want to find substring in elem, and if there is one, I'm adding elem to list.

Yogesh Patil
  • 536
  • 5
  • 20
regorg
  • 43
  • 8
  • 1
    Let me make sure I understand the problem first... you have two large lists of strings and you're trying to find all the elements that are common between the two lists based on whether a substring match exists on any element in either list? – karnesJ.R Dec 12 '19 at 16:08
  • Regexes are slow because they are mostly around complex patterns - and for that, they are good. But if you just need to find a word you know, just use `str` methods. – h4z3 Dec 12 '19 at 16:10
  • What would be the result? Are you looking for intersection of two lists of strings that ignores case (upper/lower) differences? – PM 77-1 Dec 12 '19 at 16:13
  • Well, i want to check if any element of second list is part of first element from first list, then i want to do the same for second element of first list and so on, unless i find all matches. That's why i've been using `re.search(r"\b" + substring + r"\b", elem, re.IGNORECASE)` originally. – regorg Dec 12 '19 at 16:13
  • Alright, so yes, you are checking to find all elements of the second list which are substrings of any element of the first list. If the first list is "n" elements large, and the second list is "m" elements large, you will be running in the worst case O(nm) searches. This is the best you can do. If both of your search arrays are 20,000 elements large then you are looking at doing, at worst, 400,000,000 searches. That's going to take some time. – karnesJ.R Dec 12 '19 at 16:20
  • I get it. But what about the post that i've pasted a link above? It wouldn't work in this situation? – regorg Dec 12 '19 at 16:27
  • 1
    It would work. The issue that you're running into is not the number of searches to be done, it's the type of searches you're performing. If you want to speed up the process towards optimacy then you're going to need to write a custom searcher to utilize the Trie data structure. It's much faster to use that and then parse through the substrings by looking on successive hash depths than it is to use RegEx which is computationally expensive. – karnesJ.R Dec 12 '19 at 16:47
  • Thanks! For now i'll stay with @AlexisBRENON solution, but in future i'm gonna think about it. – regorg Dec 12 '19 at 16:55
  • 1
    Updated my answer with a third solution which can better fit your needs. @karnesJ.R, I don't know Trie data structure, have you got any good link? – AlexisBRENON Dec 12 '19 at 17:27

1 Answers1

4

Regular expression are costly... In your case why not rely on something like:

found = [
    elem
    for elem in list_elem
    for substring in substrings
    if substring.lower() in elem.lower()
]

If you really want to use regular expression I advise you to compile your regex beforehand:

regexes = [
    re.compile(r"\b" + substring + r"\b", re.IGNORECASE)
    for substring in substrings
]
found = [
    elem
    for elem in list_elem
    for regex in regexes
    if regex.search(elem)
]

This should speed up a little bit your execution, but not sure that it is enough to handle a lot of elements.

Finally, if you want to work with sets like in your second code snippet, you can split your sentences and check substring membership.

Here is a code snippet that compare three solutions:

import re
import timeit


def main():
    sentences = [
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
        "Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.",
        "Donec eget est sit amet eros pharetra fermentum.",
        "Proin volutpat dui eget lacinia lobortis.",
        "Pellentesque a ante eu ligula dictum lobortis.",
        "Phasellus et velit mattis, finibus velit ut, luctus purus.",
        "Integer et risus maximus, varius eros vel, venenatis libero.",
        "Proin blandit dui vel ante blandit, et aliquet orci placerat.",
        "Suspendisse eu lacus non nunc accumsan congue non id velit.",
        "Pellentesque condimentum dui ultricies justo dapibus imperdiet eget eget tortor.",
        "Suspendisse faucibus erat a dolor fringilla suscipit.",
        "Duis quis tortor convallis purus porttitor elementum.",
        "Proin rutrum lacus sed elit convallis porttitor.",
        "Nunc sit amet felis ullamcorper, pulvinar augue sed, ullamcorper leo.",
        "Ut a erat eu lectus semper pulvinar.",
        "Praesent congue velit at nulla finibus hendrerit.",
        "Nullam ac erat convallis, suscipit purus quis, dictum eros.",
        "Proin condimentum orci nec suscipit imperdiet.",
        "Aliquam mattis mauris et turpis sodales tincidunt.",
        "Vestibulum vel ante bibendum, interdum ipsum efficitur, porta massa.",
        "Duis finibus nisl et urna pharetra, non dignissim ante tempus.",
        "Morbi nec lorem eu odio interdum feugiat id sed ligula.",
        "Donec nec lorem vitae est consequat dictum.",
        "Quisque vehicula quam eu pretium viverra.",
        "Etiam luctus metus faucibus, vulputate augue et, luctus enim.",
        "Nullam quis elit vitae enim bibendum viverra.",
        "Donec consectetur nulla sit amet nulla blandit, sed auctor mi porttitor.",
        "Suspendisse vel urna vitae massa mollis dignissim quis eu nulla.",
        "Vestibulum id risus fringilla nisl consectetur fringilla.",
        "Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.",
        "Sed feugiat orci pulvinar, imperdiet massa sit amet, semper dui.",
        "Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.",
        "Cras vel urna fermentum, tempus magna ut, aliquet dolor.",
        "Fusce non urna molestie massa sagittis accumsan et eget lorem.",
        "Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.",
    ]
    substrings = [
        "a", "eu", "non", "ac", "at", "et"
    ]

    print("if_in:")
    print(if_in(sentences, substrings))
    print(timeit.timeit(lambda : if_in(sentences, substrings), number=100))
    print("re_compile:")
    print(re_compile(sentences, substrings))
    print(timeit.timeit(lambda : re_compile(sentences, substrings), number=100))
    print("in_set:")
    print(in_set(sentences, substrings))
    print(timeit.timeit(lambda : in_set(sentences, substrings), number=100))



def if_in(sentences, substrings):
    return [
        s
        for s in sentences
        for substring in substrings
        if substring.lower() in s.lower()
    ]

def re_compile(sentences, substrings):
    regexes = [
        re.compile(r"\b" + substring + r"\b", re.IGNORECASE)
        for substring in substrings
    ]
    return [
        s
        for s in sentences
        for regex in regexes
        if regex.search(s)
    ]

def in_set(sentences, substrings):
    # Split your sentences on word boundaries (see https://stackoverflow.com/questions/37237594/how-can-i-split-at-word-boundaries-with-regexes)
    splitter = re.compile(r'\w+|\W+')
    sentence_sets = [set(splitter.findall(s)) for s in sentences]

    return [
        s
        for s, s_set in zip(sentences, sentence_sets)
        for sub in substrings
        if sub in s_set
    ]

if __name__ == '__main__':
    main()
if_in:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.', 'Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.', 'Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.', 'Donec eget est sit amet eros pharetra fermentum.', 'Donec eget est sit amet eros pharetra fermentum.', 'Proin volutpat dui eget lacinia lobortis.', 'Proin volutpat dui eget lacinia lobortis.', 'Proin volutpat dui eget lacinia lobortis.', 'Proin volutpat dui eget lacinia lobortis.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Phasellus et velit mattis, finibus velit ut, luctus purus.', 'Phasellus et velit mattis, finibus velit ut, luctus purus.', 'Phasellus et velit mattis, finibus velit ut, luctus purus.', 'Integer et risus maximus, varius eros vel, venenatis libero.', 'Integer et risus maximus, varius eros vel, venenatis libero.', 'Integer et risus maximus, varius eros vel, venenatis libero.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Pellentesque condimentum dui ultricies justo dapibus imperdiet eget eget tortor.', 'Pellentesque condimentum dui ultricies justo dapibus imperdiet eget eget tortor.', 'Suspendisse faucibus erat a dolor fringilla suscipit.', 'Suspendisse faucibus erat a dolor fringilla suscipit.', 'Duis quis tortor convallis purus porttitor elementum.', 'Proin rutrum lacus sed elit convallis porttitor.', 'Proin rutrum lacus sed elit convallis porttitor.', 'Nunc sit amet felis ullamcorper, pulvinar augue sed, ullamcorper leo.', 'Nunc sit amet felis ullamcorper, pulvinar augue sed, ullamcorper leo.', 'Ut a erat eu lectus semper pulvinar.', 'Ut a erat eu lectus semper pulvinar.', 'Ut a erat eu lectus semper pulvinar.', 'Praesent congue velit at nulla finibus hendrerit.', 'Praesent congue velit at nulla finibus hendrerit.', 'Nullam ac erat convallis, suscipit purus quis, dictum eros.', 'Nullam ac erat convallis, suscipit purus quis, dictum eros.', 'Nullam ac erat convallis, suscipit purus quis, dictum eros.', 'Proin condimentum orci nec suscipit imperdiet.', 'Aliquam mattis mauris et turpis sodales tincidunt.', 'Aliquam mattis mauris et turpis sodales tincidunt.', 'Aliquam mattis mauris et turpis sodales tincidunt.', 'Vestibulum vel ante bibendum, interdum ipsum efficitur, porta massa.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Morbi nec lorem eu odio interdum feugiat id sed ligula.', 'Morbi nec lorem eu odio interdum feugiat id sed ligula.', 'Morbi nec lorem eu odio interdum feugiat id sed ligula.', 'Donec nec lorem vitae est consequat dictum.', 'Donec nec lorem vitae est consequat dictum.', 'Quisque vehicula quam eu pretium viverra.', 'Quisque vehicula quam eu pretium viverra.', 'Quisque vehicula quam eu pretium viverra.', 'Etiam luctus metus faucibus, vulputate augue et, luctus enim.', 'Etiam luctus metus faucibus, vulputate augue et, luctus enim.', 'Etiam luctus metus faucibus, vulputate augue et, luctus enim.', 'Nullam quis elit vitae enim bibendum viverra.', 'Donec consectetur nulla sit amet nulla blandit, sed auctor mi porttitor.', 'Donec consectetur nulla sit amet nulla blandit, sed auctor mi porttitor.', 'Suspendisse vel urna vitae massa mollis dignissim quis eu nulla.', 'Suspendisse vel urna vitae massa mollis dignissim quis eu nulla.', 'Vestibulum id risus fringilla nisl consectetur fringilla.', 'Vestibulum id risus fringilla nisl consectetur fringilla.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Sed feugiat orci pulvinar, imperdiet massa sit amet, semper dui.', 'Sed feugiat orci pulvinar, imperdiet massa sit amet, semper dui.', 'Sed feugiat orci pulvinar, imperdiet massa sit amet, semper dui.', 'Sed feugiat orci pulvinar, imperdiet massa sit amet, semper dui.', 'Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.', 'Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.', 'Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.', 'Cras vel urna fermentum, tempus magna ut, aliquet dolor.', 'Cras vel urna fermentum, tempus magna ut, aliquet dolor.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.']
0.029533967001043493
re_compile:
['Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Phasellus et velit mattis, finibus velit ut, luctus purus.', 'Integer et risus maximus, varius eros vel, venenatis libero.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse faucibus erat a dolor fringilla suscipit.', 'Ut a erat eu lectus semper pulvinar.', 'Ut a erat eu lectus semper pulvinar.', 'Praesent congue velit at nulla finibus hendrerit.', 'Nullam ac erat convallis, suscipit purus quis, dictum eros.', 'Aliquam mattis mauris et turpis sodales tincidunt.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Morbi nec lorem eu odio interdum feugiat id sed ligula.', 'Quisque vehicula quam eu pretium viverra.', 'Etiam luctus metus faucibus, vulputate augue et, luctus enim.', 'Suspendisse vel urna vitae massa mollis dignissim quis eu nulla.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.']
0.13305247199969017
in_set:
['Morbi congue lacus vitae dolor tempus, eu auctor elit interdum.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Pellentesque a ante eu ligula dictum lobortis.', 'Phasellus et velit mattis, finibus velit ut, luctus purus.', 'Integer et risus maximus, varius eros vel, venenatis libero.', 'Proin blandit dui vel ante blandit, et aliquet orci placerat.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse eu lacus non nunc accumsan congue non id velit.', 'Suspendisse faucibus erat a dolor fringilla suscipit.', 'Ut a erat eu lectus semper pulvinar.', 'Ut a erat eu lectus semper pulvinar.', 'Praesent congue velit at nulla finibus hendrerit.', 'Nullam ac erat convallis, suscipit purus quis, dictum eros.', 'Aliquam mattis mauris et turpis sodales tincidunt.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Duis finibus nisl et urna pharetra, non dignissim ante tempus.', 'Morbi nec lorem eu odio interdum feugiat id sed ligula.', 'Quisque vehicula quam eu pretium viverra.', 'Etiam luctus metus faucibus, vulputate augue et, luctus enim.', 'Suspendisse vel urna vitae massa mollis dignissim quis eu nulla.', 'Etiam ut risus ultrices, consequat sapien eu, rhoncus justo.', 'Donec nec metus quis nulla sagittis ullamcorper vitae eu ante.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Fusce non urna molestie massa sagittis accumsan et eget lorem.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.', 'Curabitur ac nisl at mauris pellentesque tincidunt nec sed magna.']
0.11130324300029315

As you can see, the first solution is very fast but returns too many result because it misses the "word boundary" constraint on substring. Other solutions are quiet equivalent (multiple runs give sometimes one faster than the other and sometimes the inverse). re_compile should be faster if you have few substrings to test, else in_set should be faster if you have few sentences.

I don't know about the Trie data structure proposed by karnesJ.R, but it can be a good idea to re-think your algorithm if you really have a lot of elemets.

AlexisBRENON
  • 2,921
  • 2
  • 18
  • 30
  • This may not be the problem the OP is trying to solve. – karnesJ.R Dec 12 '19 at 16:10
  • @karnesJ.R Why? This is roughly equivalent to its first working solution. It just not handle the word boundaries, that can ever be handled with preprocessing or making the condition a little bit complex. – AlexisBRENON Dec 12 '19 at 16:15
  • Word Boundary concerns was the point of the comment. Your suggestion to pre-process is excellent. – karnesJ.R Dec 12 '19 at 16:23
  • Well, I've been trying to compile it beforehand, but it still takes a lot of time. But thanks for your answer, it's always something – regorg Dec 12 '19 at 16:25
  • @alexis well, apparently i've been using it in wrong way.. :) It does work fine, i don't have to wait 30 minutes but ~20 seconds. Thank you! – regorg Dec 12 '19 at 16:57