How to avoid extracing non-proper nouns from headings in text with capitalization?

Question

I am trying to extract keywords from a piece of text using nltk and Stanford NLP tools. After I run my code, I can get a list like this

companyA
companyB
companyC
Trend Analysis For companyA

This is all good but notice the last item. That is actually a heading that appears in the text. Since all the words for a heading are capitalized, my program thinks that those are all proper nouns and thus clubs them together as if they were a big company name.

The good thing is that as long as a company has been mentioned somewhere in the text, my program will pick it up, hence I get individual items like companyA as well. These are coming from the actual piece of text that talks about that company.

Here is what I want to do.

In the list that I get above, is there a way to look at an item and determine if any previous items are a substring of the current one? For example, in this case when I come across

Trend Analysis For companyA

I can check whether I have seen any part of this before. So I can determine that I already have companyA and thus I will ignore Trend Analysis For companyA. I am confident that the text will mention any companies enough times for StanfordNER to pick it up. Thus I do not have to rely on headings to get what I need.

Does that make sense? Is this the correct approach? I am afraid that this will not be very efficient but i can't think of anything else.

Edit

Here is the code that i use

sentences = nltk.sent_tokenize(document) 
sentences = [nltk.word_tokenize(sent) for sent in sentences] 
sentences = [nltk.pos_tag(sent) for sent in sentences]

after that i simply use the StanfordNERTagger on each sentence

result = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
for s in sentences:
    taggedwords = stn.tag(s)
    for tag, chunk in groupby(taggedWords, lambda x:x[1]):

        if tag == "ORGANIZATION":

            result.append((tag, " ".join(w for w, t in chunk)))

return result

in this way i can get all the ORGANIZATIONs.

To @Alvas's point about Truecasing, dont you think its a bit of an overkill here? When i studied the algorithm, it appears to me that they are trying to come up with the most likely spelling for each word. The likelihood will be based on a corpus. I dont think that i will need to build a corpus as i can use a dictionary like wordnet or something like pyenchant to figure out the appropriate spelling. Also, here i already have all the information i need i.e. i am picking up all the companies mentioned.

There is another problem. Consider the company name

American Eagle Outfitters

note that American and american are both proper spellings. Simlarl for Eagle and eagle. I am afraid that even if i employ Truecasing into my algorithm, it will end up lowercasing terms that should not be lowercased.

Again, my problem right now is that i have all the company names extracted, but i am also extracting the headings. The brute force way wold be to perform a substring check on the list of results. I was just wondering whether there is a more efficient way of doing this. Moreover, i dont think that any tweaking that i do will improve the tagging. I dont think i will be able to outperform StanfordNERTagger

Once again, interesting but not appropriate for Stackoverflow. How did you extract the text using `NLTK` + `Stanford NLP`? Can you post a link to the code or provide a code snippet? Depending on how you extracted the phrase, you might not be extracting what you really wanted. — alvas, Jan 19 '16 at 15:54
Did you ever tried the `truecasing` suggestion from http://stackoverflow.com/questions/34439208/nltk-stanfordnertagger-how-to-get-proper-nouns-without-capitalization/34458164#34458164? I'm pretty sure truecasing will cause you less pain on NER in general. I'm going to vote this as a duplicate because now you have the reverse problem of "false positive" with caps instead of "false negative" without caps. Possible solutions remain the same, e.g. do `truecasing` — alvas, Jan 19 '16 at 15:55
Read section 3.4.1 from https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf — alvas, Jan 19 '16 at 15:59
@GaborAngeli, he was using caseless model and http://stackoverflow.com/questions/34439208/nltk-stanfordnertagger-how-to-get-proper-nouns-without-capitalization/34458164#34458164 . I think the more important thing is to understand the problem, understand the existing technologies before leaping into implementing or executing one. — alvas, Jan 19 '16 at 16:03
please read up on http://stackoverflow.com/questions/1832076/what-is-the-difference-between-supervised-learning-and-unsupervised-learning then http://stats.stackexchange.com/questions/897/online-vs-offline-learning and some `nlp` and `machine-learning` introduction: http://stackoverflow.com/questions/34791491/where-to-start-natural-language-processing-and-ai-using-python/34791965#34791965 — alvas, Jan 19 '16 at 19:02
The system/model will be probabilistic but it's immutable. Seeing a named entity many times in testing will not change how the model will tag the entity, you would have to retrain it with annotated data in order for the model to relearn how to tag the text type you need, i.e. the 2nd solution from http://stackoverflow.com/questions/34791491/where-to-start-natural-language-processing-and-ai-using-python/34791965#34791965 (annotate a dataset and retrain the system based on the annotations instead of using the off-the-shelf models from Stanford NLP). — alvas, Jan 19 '16 at 19:05
thanks @Alvas for the good suggestions as always. Please see the edit, i have tried to explain a few other problems. — AbtPst, Jan 20 '16 at 08:06
Truecasing is not an overkill. See `gensim` creator's comment https://twitter.com/RadimRehurek/status/689624293794181120 . In fact, Truecasing is an essential part of any good NLP systems. It's a cheap step to truecase, see https://github.com/moses-smt/mosesdecoder/tree/master/scripts/recaser. In fact it might be cheaper than tokenization... — alvas, Jan 20 '16 at 10:44
absolutely, but since i am getting all the companies already, do i really need to truecase just to tackle the headings? and what about things like `Eagle` and `eagle` where both casings are correct, depending on the context. i will need to read a bit more to find out if i can incorporate context into truecasing. do you think it can be done? — AbtPst, Jan 20 '16 at 12:34
http://www.urbandictionary.com/define.php?term=the+proof+is+in+the+pudding — alvas, Jan 20 '16 at 16:42

score 2 · Answer 1 · answered Jan 20 '16 at 10:12

I did encounter a similar problem, but the whole text was uncapitalized (ASR output). In that case I did retrain NER model on uncapitalized annotated data to obtain better performances.

Here are a few options I would consider, by order of preference (and complexity):

Uncapitalize text: After tokenization / sentence splitting, try to guess what sentences are all capitalized, and use a dictionary-based approach to uncapitalize unambiguous tokens (this may be viewed as a sequence labelling problem, and could involve machine learning... but data can be easily generated for this problem).
Learn a model with capitalization features You may add a feature to machine learning to rebuild both POS and NER models, but this would require to have corpora to retrain models.
Postprocess data Given the fact that the tagging is error-prone, you may apply some postprocessing, which may take into account the previously discovered entities with substring matching. But this process would be hazardous, since if you find "America" and "Bank Of America", the correction will probably not be able to know that "Bank Of America" is actually an entity.

I would personnaly consider first option, as you can easily artificially create an all-capitalized corpus (from correctly capitalized texts) and train a sequence labeller model (e.g. CRF) model to detect where capitalization should be removed.

Whatever approach you use, you'll indeed never end up with performances as good as for correctly capitalized texte. Your input can be considered as partly noisy, since a clue is missing both for POS tagging and NER.

thanks for the nice suggestions. by the way, `America` would be labeled as a `LOCATION` not `ORGANIZATION` ;) i did try to play with many scenarios like these and hence i think post processing might not be a bad idea. — AbtPst, Jan 20 '16 at 12:38
you bring up an interesting point. if an entire sentence is all caps, then its probably a heading and maybe i can do something about it as soon as i find it. however if i try to uncapitalize the unambiguous text using a dictionary, then `Bank Of America` would probably become `bank of America`. This might be incorrect for the sentence/heading itself but i will still be able to live with this provided that `Bank Of America` appears at least once elsewhere in the text. — AbtPst, Jan 20 '16 at 12:40
also, how can i generated data for this problem? do you mean general data or just by looking at the document in hand? — AbtPst, Jan 20 '16 at 12:41
Yes, America would most probably be a location - depending on context though :) As for generating data, you may use general data: take any (correctly capitalized) text, then upcase each token: using both sequences (original vs all capitalized) you may train a sequence labeller that learns which tokens are to be uncapitalized. Not sure how efficient the algorithm would be, but it is certainly feasible and probably quite interesting! — eldams, Jan 22 '16 at 22:19

How to avoid extracing non-proper nouns from headings in text with capitalization?

1 Answers1