-1

I am trying to look for keywords in sentences which is stored as a list of lists. The outer list contains sentences and the inner list contains words in sentences. I want to iterate over each word in each sentence to look for keywords defined and return me the values where found.

This is how my token_sentences looks like. enter image description here

I took help from this post. How to iterate through a list of lists in python? However, I am getting an empty list in return.

This is the code I have written.

 import nltk
 from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize

 text = "MDCT SCAN OF THE CHEST:     HISTORY: Follow-up LUL nodule.   TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm.   COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015.     FINDINGS:   Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."  

 tokenizer_words = TweetTokenizer()
 tokens_sentences = [tokenizer_words.tokenize(t) for t in 
 nltk.sent_tokenize(text)]

 nodule_keywords = ["nodules","nodule"]
 count_nodule =[]
 def GetNodule(sentence, keyword_list):
     s1 = sentence.split(' ')
     return [i for i in  s1 if i in keyword_list]

 for sub_list in tokens_sentences:
     result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
     count_nodule.append(result_calcified_nod)

However, I am getting the empty list as a result for the variable in count_nodule.

This is the value of first two rows of "token_sentences".

token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]

Please help me to figure out where I am doing wrong!

khushbu
  • 567
  • 2
  • 8
  • 24

2 Answers2

2

The error is here:

for sub_list in tokens_sentences:
     result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)

You are looping over each sub_list in tokens_sentences, but only passing the first word sub_list[0] to GetNodule.

This type of error is fairly common, and somewhat hard to catch, because Python code which expects a list of strings will happily accept and iterate over the individual characters in a single string instead if you call it incorrectly. If you want to be defensive, maybe it would be a good idea to add something like

assert not all(len(x)==1 for x in sentence)

And of course, as @dyz notes in their answer, if you expect sentence to already be a list of words, there is no need to split anything inside the function. Just loop over the sentence.

return [w for w in sentence if w in keyword_list]

As an aside, you probably want to extend the final result with the list result_calcified_nod rather than append it.

tripleee
  • 175,061
  • 34
  • 275
  • 318
2
  1. You need to remove s1 = sentence.split(' ') from GetNodule because sentence has already been tokenized (it is already a List).

  2. Remove the [0] from GetNodule(sub_list[0], nodule_keywords). Not sure why you would want to pass the first word of each sentence into GetNodule!

dyz
  • 127
  • 8