I'm working on a NLP project based on Python/NLTK with non-english unicode text. For that, I need to search unicode string inside a sentence.
There is a .txt file saved with some non-english unicode sentences. Using NLTK PunktSentenceTokenizer i broke them and saved in a python list.
sentences = PunktSentenceTokenizer().tokenize(text)
Now i can iterate through list and get each sentence
separately.
What i need to do is go through that sentence
and identify which word has the given unicode characters.
Example -
sentence = 'AASFG BBBSDC FEKGG SDFGF'
Assume above text is non-english unicode and i need to find words ending with GF
then return whole word (may be index of that word).
search = 'SDFGF'
Similarly i need to find words starting with BB
get the word of it.
search2 = 'BBBSDC'