0

After lemmatization of text I have a list of lemmas. For each element of this list I would like to figure out is it a word ("cat", "dog", "go", "red") or non-word (".","rand_yh4jhdf","'''","100x200","42,44,46","22:00","xxx___BATMAN___xxx"). Is this problem have a simple solution? How can I differ word vs non-word with Python and NLTK?

UPD. (for the question what a word is) I want to clear my list from total garbage. Remove what is totally not a word. Don't touch complicated edge cases.

Nik
  • 105
  • 1
  • 7
  • How do you define a word ? Is it something that exists in the dictionary? Or rather something that consists of letters and not numbers or punctuation ? In the second case "jklsddjladajjlk" would be classified as a word. But in the first it wouldn't as you can't find it in any dictionary. – pawelty Jan 18 '17 at 13:03
  • I don't have the strict definition for a word. A word is what a word for average person. Looks like my problem have no simple solution. – Nik Jan 18 '17 at 13:06
  • If you expect English words there maybe this solution would help you http://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python . If you just want to check if it doesn't have numbers and punctuation and do have vocals (each word has vocals, right?) you could just write simple tests for that (regex?). – pawelty Jan 18 '17 at 13:13

1 Answers1

0

The following would only returns strings without numbers and punctuation:

import re
test = [".","rand_yh4jhdf","''","100x200","42,44,46","22:00","xxx___BATMAN___xxx", "dog", "cat", "computer"]

words = [word for word in test if re.match("^[a-zA-Z]*$", word)]
print(words)

output:

['dog', 'cat', 'computer']
pawelty
  • 1,000
  • 8
  • 27