1

Given a block of lower-case text, how do you identify acronyms using a tool like Spacy, or something similar? I'm trying to intelligently capitalize words if they're proper-nouns, and I'm having trouble identifying acronyms.

Spacy's POS tagger works reasonably well for identifying proper nouns, including most common acronyms, via its standard document object but I don't see any easy way to differentiate between a short name and an acronym in the tokens it returns.

For example:

import spacy
nlp = spacy.load('en_core_web_lg')
text = 'joe bought stock in ibm'
doc = nlp(text)
for i, token in enumerate(doc):
    print(i, token.text, token.pos_)

prints out:

0 joe PROPN
1 bought VERB
2 stock NOUN
3 in ADP
4 ibm PROPN

So it correctly identified the two proper nouns. However, there doesn't seem to be anything in the tokens for 0 or 4 that identify one as a regular name whereas the other is an acronym.

I can't find anything in the docs to clarify. Is there any way in Spacy to detect an acronym? If not, are there any other reliable ways?

Cerin
  • 60,957
  • 96
  • 316
  • 522

0 Answers0