Named entity recognition with Java

Question

I would like to use named entity recognition (NER) to find adequate tags for texts in a database. Instead of using tools like NLTK or Lingpipe I want to build my own tool.

So my questions are:

Which algorithm should I use?
How hard is to build this tool?

Since there are so many ways to go about it, we could better inform you if you shared your goals and why you're trying to to DIY. And are you willing to use any libraries at all, such as machine learning? — John Lehmann, Apr 08 '11 at 03:19

score 5 · Answer 1 · answered Apr 06 '11 at 21:16

I did this some time ago when I studied Markov chains.

Anyway, the answers are:

Which algorithm should I use?

Stanford NLP for example uses Conditional Random Field (CRF). If you are not trying to do this effectively, you are like dude from Jackass 3d who was pissing in the wind. There is no simple way to parse human language, as it's construction is complex and it has tons of exceptions.

How hard is to build this tool?

Well if you know what you are doing, then it's not that hard at all. The process of entering the rules and logic can be annoying and time consuming, and fixing bugs can be nontrivial. But in 20 years, you can make something almost useful (for yourself).

Skarab · Answer 2 · 2011-04-06T21:21:47.767

There is vast of Information Extraction algorithms, to name a few: regular expressions, statical methods, machine learning based, dictionaries, etc. You can find a complete overview on methods in this survey.
Yes, it is hard to build a tool, which find tags with high precision, because it requires a lot of testing and tuning.

The -- easiest to implement -- algorithm for finding tags will consists of two steps:

Extract candidates for tags
Find most significant tags - most disti.

In the first step you can take one of two approaches:

Use entity names to use as tag candidates (here you need to use Information Extraction framework)
Use nouns or noun groups as tag candidates (here you need to use part-of-speech tagger)

In the second step, you should use tf-idf to weight tags across document corpus and discard all tags which has tf-idf weight below a given trash-hold

If you need a more powerful algorithm look for topic detection frameworks or research papers on this topic. Check also LSA, after wikipedia:

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

Please check also his question - http://stackoverflow.com/questions/5544475/does-an-algorithm-exist-to-help-detect-the-primary-topic-of-an-english-sentence. — Skarab, Apr 06 '11 at 21:20
! Please check also this post - http://nlpers.blogspot.com/2011/04/seeding-transduction-out-of-sample.html , it describes researcher's -- hands on -- experience in creating taggers. — Skarab, Apr 07 '11 at 08:50

score 2 · Answer 3 · answered Apr 06 '11 at 20:38

2

NLTK is an open-source project. You might want to explore it a little bit - see how it is done, maybe get involved in the community, rather than trying to completely solve the problem by yourself from scratch...

answered Apr 06 '11 at 20:38

Avi

19,934
4
57
70

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

0

Look for a copy of this paper:

Name Tagging with Word Clusters and Discriminative Training

Scott Miller, Jethran Guinness, Alex Zamanian

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 07 '11 at 01:08

bmargulies

97,814
39
186
310

score 0 · Answer 5 · answered Jun 03 '11 at 15:28

This may not be a satisfactory answer to your question, still: You might want to evaluate existing service providers for the task and either include their product or integrate one via web services.

My experience is that for certain well-defined and very domain-specific tasks (for example: recognizing names of medicaments within Wikipedia web pages) you can manually build NER solutions. LingPipe, OpenNLP, etc. are good tools for this.

But for generic tasks (for example: find person names in any web page on the internet), you need a lot of experience, tools, and man-power to get satisfactory results. It might therefore be more effective to use an external provider. OpenCalais is a free service, for example; many commercial ones exist.

Named entity recognition with Java

5 Answers5

Linked