7

I have lots of strings like following,

  1. ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
  2. KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
  3. ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

I am using NLTK to remove the dateline part and recognize the date, location and person name?

Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?

Update:

Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.

Update:

I use ne_chunk. But no luck.

import nltk

def pchunk(t):
    w_tokens = nltk.word_tokenize(t)
    pt = nltk.pos_tag(w_tokens)
    ne = nltk.ne_chunk(pt)
    print ne

# txts is a list of those 3 sentences.
for t in txts:                                            
    print t
    pchunk(t)

Output is following,

ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab

(S
  ISLAMABAD/NNP
  :/:
  Chief/NNP
  Justice/NNP
  (PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
  said/VBD
  that/IN
  (ORGANIZATION National/NNP Accountab/NNP))

KARACHI, July 24 -- Police claimed to have arrested several suspects in separate

(S
  (GPE KARACHI/NNP)
  ,/,
  July/NNP
  24/CD
  --/:
  Police/NNP
  claimed/VBD
  to/TO
  have/VB
  arrested/VBN
  several/JJ
  suspects/NNS
  in/IN
  separate/JJ)

ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin

(S
  (GPE ALUM/NN)
  (ORGANIZATION KULAM/NN)
  ,/,
  (PERSON Sri/NNP Lanka/NNP)
  --/:
  As/IN
  gray-bellied/JJ
  clouds/NNS
  started/VBN
  to/TO
  blot/VB
  out/RP
  the/DT
  scorchin/NN)

Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
  • ISLAMABAD is not recognized- it is tagged as NNP not GPE – Spaceghost Feb 07 '14 at 02:27
  • In your examples, the locations (and the one date) showed up in the beginning of the string. Also there was a delimiter before the rest of the news story began. Is this a pattern in the rest of your data? – Frank T Feb 12 '14 at 14:08
  • @FrankT The pattern is not consistent. different provider has different type of delimiter. and its not same all the time. Sometimes `--` sometimes just a `.` or `-`. Its possible to apply regular expression. But regex does not recognize Names. It works for characters only – Shiplu Mokaddim Feb 12 '14 at 15:40

2 Answers2

2

If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.

enter image description here

Wit will also resolve date/time tokens into normalized dates.

To get started you just have to provide a few examples.

Blacksad
  • 14,906
  • 15
  • 70
  • 81
  • I need to do it with my own code. The texts are coming from different Internet sources. I dont want to call another HTTP request to slow down the aggregation. also it will make the aggregation dependent to external services. Any way to do it using my own code? – Shiplu Mokaddim Feb 05 '14 at 05:27
  • In this case check out the NER (Named Entity Recognition) module in NLTK. It can recognize dates, person and locations for you. – Blacksad Feb 05 '14 at 05:30
  • 7
    **This is not the answer**. I dont want to be dependent on external service – Shiplu Mokaddim Feb 13 '14 at 10:02
1

Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:

http://developer.yahoo.com/boss/geo/

May also be worth looking at using the dreaded REGEX in order to identify capitals: Regular expression for checking if capital letters are found consecutively in a string?

Good luck!

Community
  • 1
  • 1
Malcolm Murdoch
  • 1,075
  • 6
  • 9