How to do exact match in a paragraph of by using the list of strings in python

Question

I have a list of strings with somer version number. I would like to find (exact) these list of strings in a paragraph Example products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

paragraph = "Troubleshooting steps for productA v4.1.5 documents"

In this case if Iam using filter like following

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
paragraph = "Troubleshooting steps for productA v4.1.5 documents"
def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False
results = filter(checkIfProdExist, products)
print(list(results))

The output of above code is ['productA v4.1', 'productA v4.1.5']

How can i make only 'productA v4.1.5' find in paragraph and get its index value?

If you don't want to find `productA v4.1`, don't include it in your list of things to find. — Scott Hunter, Dec 10 '20 at 16:40
Both of those exact substrings _are_ in the document, so the output is correct. Are you saying that you would like to see only the _longest_ match? — John Gordon, Dec 10 '20 at 16:40
My String can contain any products from products list. For example my paragraphB contain "productA v4.1". and "productA v4.1.5" paragraphB = "Troubleshooting steps for productA v4.1 documents and productA v4.1.5" In this case myoutput should return both values. When there is only "productA v4.1.5" in paragraph then the output should not contain "productA v4.1" — user3164444, Dec 10 '20 at 18:03

score 1 · Answer 1 · answered Dec 10 '20 at 16:48

1

You want to find the longest match, so you should start matching using the longest string first:

products = ["productA v4.1", "productA v4.1.5", "product A v4.1.5 ver"]
productsSorted = sorted(products, key=len, reverse=True)
paragraph = "Troubleshooting steps for productA v4.1.5 documents"


def checkIfProdExist(x):
    if paragraph.find(x) != -1:
        return True
    else:
        return False


def checkIfProdExistAndExit(prods):
    # stop immediately after the first match!
    for x in prods:
        if paragraph.find(x) != -1:
            return x


results = filter(checkIfProdExist, productsSorted)
print(list(results)[0])
results = checkIfProdExistAndExit(productsSorted)
print(results)

Out:

productA v4.1.5
productA v4.1.5

answered Dec 10 '20 at 16:48

Maurice Meyer

17,279
4
30
47

This was my first thought also. But if the search strings are unrelated such as "car" and "airplane", this code will stop after finding "airplane" when you would really want to also find "car". – John Gordon Dec 10 '20 at 17:19
Thank for quick reply . The main issue here is product list will have n number of values from different products and paragraph can refer any/many of the products from the list. ie. paragraph can have only "productA v4.1.5" or it can have both "productA v4.1.5" and "productA v4.1" or it can only contain "productA v4.1". So if the paragraph contains " "productA v4.1" then the output should be "productA v4.1". If the paragraph contains "product A v4.1.5 ver" then the output should not have productA v4.1.5 or productA v4.1 – user3164444 Dec 10 '20 at 18:12
I just realised i added extra space between product and A in of the list value. Corrected to "productA v4.1.5 ver". My Bad. – user3164444 Dec 11 '20 at 20:36

ShadowRanger · Answer 2 · 2020-12-10T18:39:48.793

Sounds like you basically want the beginning and end of the match to be either the end of the paragraph, or a transition to a space character (the end of a "word", though sadly, the regex definition of word excludes stuff like ., so you can't use tests based on \b).

The simplest approach here is to just split the line by whitespace, and see if the string you have occurs in the resulting list (using some variation on finding a sublist in a list):

def list_contains_sublist(haystack, needle):
    firstn, *restn = needle  # Extracted up front for efficiency
    for i, x in enumerate(haystack, 1):
        if x == firstn and haystack[i:i+len(restn)] == restn:
            return True
    return False

para_words = paragraph.split()
def checkIfProdExist(x):
    return list_contains_sublist(para_words, x.split())

If you want the index too, or need precise whitespace matching, it's trickier (.split() won't preserve runs of whitespace so you can't reconstruct the index, and you might get the wrong index if you index the whole string and the substring occurs twice, but only the second one meets your requirements). At that point, I'd probably just go with a regex:

import re

def checkIfProdExist(x):
    m = re.search(fr'(^|\s){re.escape(x)}(?=\s|$)', paragraph)
    if m:
        return m.end(1)  # After the matched space, if any
    return -1  # Or omit return for implicit None, or raise an exception, or whatever

Note that as written, this won't work with your filter (if the paragraph begins with the substring, it returns 0, which is falsy). You might have it return None on failure and a tuple of the indices on success so it works both for boolean and index-demanding cases, e.g. (demonstrating walrus use for 3.8+ for fun):

def checkIfProdExist(x):
    if m := re.search(fr'(?:^|\s)({re.escape(x)})(?=\s|$)', paragraph):
        return m.span(1)  # We're capturing match directly to get end of match easily, so we stop capturing leading space and just use span of capture
    # Implicitly returns falsy None on failure

Thank you the quick reply. If my paragraph contains "product A v4.1.5 ver" (3rd value from the list) then split the line by whitespace will not work right? it will just return "product A v4.1.5 " not "product A v4.1.5 ver" — user3164444, Dec 10 '20 at 18:19
@user3164444: Ah, blech. Yeah, I'm a moron. Hold on, will "fix" (won't be a great fix; the regex solution is probably the way to go here). — ShadowRanger, Dec 10 '20 at 18:30
@user3164444: "Fixed". Like I said, as the problem grows sufficiently complex, simple string methods become less and less useful. You could always just do repeated `.find`/`.index` calls (at increasing offsets until a true hit or it returns `-1`/raises an exception) and manually test if the pattern is followed by whitespace or the end of the string (`after_found = paragraph[foundidx+len(x):foundidx+len(x)+1]`, then `if not after_found or after_found.isspace(): return foundidx`), but at that point you're writing a lot of very specific custom code; it's harder to maintain, and usually slower. — ShadowRanger, Dec 10 '20 at 18:45
Thank You @ShadowRanger! Solved my use case by doing reverse sort on products list and stripping the 1st matched product occurrences from the paragraph. Posted the code how i did. It may or may not be the right approach but solved my purpose. Appreciate your research and help! — user3164444, Dec 11 '20 at 20:37

score 0 · Accepted Answer · answered Dec 11 '20 at 20:41

Solved my use case by doing reverse sort on products list and stripping the 1st matched product occurrences from the paragraph. Following is the code how i did. It may or may not be the right approach but solved my purpose. It is working even products list has n no of products and paragraph has many matched strings from products list. Appreciate all of your research and help!

products = ["productA v4.1", "productA v4.1.5", "productA v4.1.5 ver"]

#applying the reverse sorting so that large strings comes first
products = sorted(products, key=len, reverse=True)

paragraph = "Troubleshooting steps for productA v4.1.5 ver documents also has steps for productA v4.1 document "


def checkIfProdExist(x):
  if paragraph.find(x) != -1:
    return True
  else:
    return False

#filter all matched strings
prodResults = list(filter(checkIfProdExist, products))

print(prodResults)
# At this state Result is  = ['productA v4.1.5 ver', 'productA v4.1.5', 'productA v4.1']

finalResult = []

# Loop through the matched the strings
for prd in prodResults:
  if paragraph.find(prd) != -1:
    # Loop through the each matched string and copy the first index
    finalResult.append({"index":str(paragraph.find(prd)),"value":prd})
    
    #Once Index copied replace all occurrences of matched string with empty so that next short string will not find it. i.e. removing productA v4.1.5 ver occurrences in paragraph will not provide chance to match productA v4.1.5 and productA v4.1  
    paragraph = paragraph.replace(prd,"")
    
print(finalResult)
# Final Result is [{'index': '26', 'value': 'productA v4.1.5 ver'}, {'index': '56', 'value': 'productA v4.1'}]
# If Paragraph is "Troubleshooting steps for productA v4.1.5 documents" then the result is [{'index': '26', 'value': 'productA v4.1.5'}]

How to do exact match in a paragraph of by using the list of strings in python

3 Answers3