0

I think I may be fundamentally confused about something in python or nltk. I'm generating a list of tokens from a paper abstract, and attempting to see if a search word is contained by the tokens. I do know about concordance, but it doesn't work well with my intended use of the comparison.

Here is my code:

def tokenize(text):
    tokens = nltk.word_tokenize(text.get_text())
    return tokens

def search_abstract_single_word(tokens, keyword):
    match = 0
    for token in tokens:
        if token == keyword:
            match += 1
    return match

def search_file_single_word(abstract_list, keyword):
    matches = list()
    for item in abstract_list:
        tokens = tokenize(item)
        match = search_abstract_single_word(tokens, keyword)
        matches.append(match)
    return matches

I've confirmed that the tokens and keyword being passed in are correct, but match (and thus the entire list of matches) always evaluates zero. I was under the understanding word_tokenize returns an array of strings, so I don't see why, for example, when token = computer and keyword = computer, token == keyword does not return true and increment match.

EDIT: In a standalone class/main method this code does appear to work. However, the code is being called from a tkinter window like so:

self.keyword = ""
....
self.keywords_box = Text(self.Frame2)
....
self.Submit = Button(master)
self.Submit.configure(command=self.submit)
....
#triggered by submit button
def submit(self):
    self.keywords += self.keywords_box.get("1.0", END)

#triggered by run button after keyword saved
def run(self):
    search_input = self.keywords
    ....
    #use pandas to read excel file, create abstracts, and store
    ....
    matches = search_file_single_word(abstract_list, search_input)
    for match in matches:
        self.output_box.insert(END, match)
        self.output_box.insert(END, '\n')

I had assumed because print(keyword) was outputting correctly if I inserted it into search_file_single_word, that the value was passed correctly, but is it actually just passing the tkinter property along and refusing to evaluate it vs the token?

Harry F
  • 11
  • 1
  • Is the keyword you are looking for already tokenized ? – J.Zagdoun Jul 12 '18 at 15:35
  • It seems to work to me: https://pastebin.com/0xX7q0am – Neil Jul 12 '18 at 15:38
  • @J.Zagdoun I tried tokenizing the keyword, it did not make a difference. – Harry F Jul 12 '18 at 17:14
  • @Neil Could it work by itself but not for me because the keyword passed is a tkinter window property? – Harry F Jul 12 '18 at 17:16
  • Wait what do you mean by betokening the keyword? If you tokenizing the keyword then you're comparing a list to a string... – Neil Jul 12 '18 at 19:19
  • @Neil I had simply run tokenize on the keyword, which would just return a single token (I think) after JZ asked. Didn't change anything so I reverted it. It was related to the GUI after all though. After re-reading some pages on text box get(), END will place a new line character at the end of keyword, which was invisible in my test prints but would in fact cause an inequality. Changed it to 'end - 1c' from another stack exchange post and it runs fine now – Harry F Jul 12 '18 at 21:07

1 Answers1

0

Moral of the story, be careful with options. Using textbox.get("1.0", END) will insert a newline character. string != string\n. Solution found in answer to this post

Harry F
  • 11
  • 1