0

I'm working on a NLP project based on Python/NLTK with non-english unicode text. For that, I need to search unicode string inside a sentence.

There is a .txt file saved with some non-english unicode sentences. Using NLTK PunktSentenceTokenizer i broke them and saved in a python list.

sentences = PunktSentenceTokenizer().tokenize(text)

Now i can iterate through list and get each sentence separately.


What i need to do is go through that sentence and identify which word has the given unicode characters.

Example -

sentence = 'AASFG BBBSDC FEKGG SDFGF'

Assume above text is non-english unicode and i need to find words ending with GF then return whole word (may be index of that word).

search = 'SDFGF'

Similarly i need to find words starting with BB get the word of it.

search2 = 'BBBSDC'
Sukrit Kalra
  • 33,167
  • 7
  • 69
  • 71
ChamingaD
  • 2,908
  • 8
  • 35
  • 58

1 Answers1

1

If I understand correctly, you just have to split up the sentence into words, loop over each one and check if it ends or starts with the required characters, e.g:

>>> sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']
>>> [word for word in sentence.split() if word.endswith("GF")]
['SDFGF']

sentence.split() could probably be replaced with something like nltk.tokenize.word_tokenize(sentence)

Update, regarding comment:

How can get word in-front of that and behind it

The enumerate function can be used to give each word a number, like this:

>>> print list(enumerate(sentence))
[(0, 'AASFG'), (1, 'BBBSDC'), (2, 'FEKGG'), (3, 'SDFGF')]

Then if you do the same loop, but preserve the index:

>>> results = [(idx, word) for (idx, word) in enumerate(sentence) if word.endswith("GG")]
>>> print results
[(2, 'FEKGG')]

..you can use the index to get the next or previous item:

>>> for r in results:
...     r_idx = r[0]
...     print "Prev", sentence[r_idx-1]
...     print "Next", sentence[r_idx+1]
...
Prev BBBSDC
Next SDFGF

You'd need to handle the case where the match the very first or last word (if r_idx == 0, if r_idx == len(sentence))

dbr
  • 165,801
  • 69
  • 278
  • 343
  • Now I got another question. From above code i could able to find words ending or starting with given letters. How can get word in-front of that and behind it. For example, if i search for GG and got FEKGG and then i need to get BBBSDC as word in-front and SDFGF and word behind. – ChamingaD Aug 11 '13 at 10:49