0

My main string is in dataframe and substrings are stored in lists. My desired output is to find the matched substring. Here is the code I am using.

sentence2 = "Previous study: 03/03/2018 (other hospital)  Findings:   Lung parenchyma: The study reveals evidence of apicoposterior segmentectomy of LUL showing soft tissue thickening adjacent surgical bed at LUL, possibly post operation." 
blob_sentence = TextBlob(sentence2)
noun = blob_sentence.noun_phrases
df1 = pd.DataFrame(noun)
comorbidity_keywords = ["segmentectomy","lobectomy"]
matches =[]
for comorbidity_keywords[0] in df1:
    if comorbidity_keywords[0] in df1 and comorbidity_keywords[0] not in matches:
       matches.append(comorbidity_keywords)

This gives me the result as the string that is not an actual match. The output should be "segmentectomy". But I get [0,'lobectomy']. Please Help!!. I have tried to take help from the answer posted here. Check if multiple strings exist in another string Please help to find out what am I doing incorrectly?

Yuan JI
  • 2,927
  • 2
  • 20
  • 29
khushbu
  • 567
  • 2
  • 8
  • 24
  • 1
    Begin with fixing `for comorbidity_keywords[0] in df1:` - you're essentially iterating over your `DataFrame` storing each row as the first element of your `comorbidity_keywords` list. Replace that line with something like `for keyword in comorbidity_keywords:` and then use `keyword` instead of `comorbidity_keywords[0]` in your `if...` check. – zwer Mar 10 '19 at 08:53
  • @zwer Edited like this ` matches =[] ` for keyword in comorbidity_keywords: if keyword in df1 and keyword not in matches: matches.append(keyword) But this gives empty results – khushbu Mar 10 '19 at 10:22

2 Answers2

1

I don't really use TextBlob, but I have two methods that might help you get to your goal. Essentially, I'm splitting the sentence by a whitespace and iterating through that to see if there are any matches. One method returns a list and the other a dictionary of index values and the word.

### If you just want a list of words
def find_keyword_matches(sentence, keyword_list):
    s1 = sentence.split(' ')
    return [i for i in  s1 if i in keyword_list]

Then:

find_keyword_matches(sentence2, comorbidity_keywords)

Output:

['segmentectomy']

For a dictionary:

def find_keyword_matches(sentence, keyword_list):
    s1 = sentence.split(' ')
    return {xyz.index(i):i for i in xyz if i in comorbidity_keywords}

Output:

{17: 'segmentectomy'}

Finally, an iterator that will also print where in the sentence a word is found, if at all:

def word_range(sentence, keyword):
    try:
        idx_start = sentence.index(keyword)
        idx_end = idx_start + len(keyword)
        print(f'Word \'{keyword}\' found within index range {idx_start} to {idx_end}')
        if idx_start > 0:
            return keyword
    except ValueError:
        pass

Then do a nested list comprehension to get rid of None values:

found_words = [x for x in [word_range(sentence2, i) for i in comorbidity_keywords] if not x is None]
Mark Moretto
  • 2,344
  • 2
  • 15
  • 21
0

There should be some more efficient way to do this. But this is what I have come up with using two for loops for two lists.

for ckeyword in comorbidity_keywords:
   for keyword in df1.values.tolist():
     if any(ckeyword in key for key in keyword):
        matches.append(ckeyword)
khushbu
  • 567
  • 2
  • 8
  • 24