Detecting lower-case acronyms in text

Question

Given a block of lower-case text, how do you identify acronyms using a tool like Spacy, or something similar? I'm trying to intelligently capitalize words if they're proper-nouns, and I'm having trouble identifying acronyms.

Spacy's POS tagger works reasonably well for identifying proper nouns, including most common acronyms, via its standard document object but I don't see any easy way to differentiate between a short name and an acronym in the tokens it returns.

For example:

import spacy
nlp = spacy.load('en_core_web_lg')
text = 'joe bought stock in ibm'
doc = nlp(text)
for i, token in enumerate(doc):
    print(i, token.text, token.pos_)

prints out:

0 joe PROPN
1 bought VERB
2 stock NOUN
3 in ADP
4 ibm PROPN

So it correctly identified the two proper nouns. However, there doesn't seem to be anything in the tokens for 0 or 4 that identify one as a regular name whereas the other is an acronym.

I can't find anything in the docs to clarify. Is there any way in Spacy to detect an acronym? If not, are there any other reliable ways?

It [does not look an easy thing](https://stackoverflow.com/questions/43510778/python-how-to-intuit-word-from-abbreviated-text-using-nlp). — Wiktor Stribiżew, Oct 08 '19 at 20:32
@SamH. Unfortunately, Spacy's implementation seems to be case dependent. In my example, `doc[0].ent_type_` is ORG, not PERSON. However, if I capitalize Joe, then it correctly tags it as PERSON. — Cerin, Oct 09 '19 at 01:19
You could also use a better ner model like https://github.com/zalandoresearch/flair — Sam H., Oct 09 '19 at 02:18

Detecting lower-case acronyms in text

0 Answers0