I have lots of strings like following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
I am using NLTK to remove the dateline part and recognize the date, location and person name?
Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?
Update:
Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.
Update:
I use ne_chunk
. But no luck.
import nltk
def pchunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
# txts is a list of those 3 sentences.
for t in txts:
print t
pchunk(t)
Output is following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
(S
ISLAMABAD/NNP
:/:
Chief/NNP
Justice/NNP
(PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
said/VBD
that/IN
(ORGANIZATION National/NNP Accountab/NNP))
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
(S
(GPE KARACHI/NNP)
,/,
July/NNP
24/CD
--/:
Police/NNP
claimed/VBD
to/TO
have/VB
arrested/VBN
several/JJ
suspects/NNS
in/IN
separate/JJ)
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
(S
(GPE ALUM/NN)
(ORGANIZATION KULAM/NN)
,/,
(PERSON Sri/NNP Lanka/NNP)
--/:
As/IN
gray-bellied/JJ
clouds/NNS
started/VBN
to/TO
blot/VB
out/RP
the/DT
scorchin/NN)
Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.