2

For the problem of named entity recognition,

After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?

What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?

Is there any example of a completed training file and template for NER?

erotavlas
  • 4,274
  • 4
  • 45
  • 104
  • If you found my answer helpful I would appreciate it if you would accept it by clicking the checkmark. – polm23 Dec 13 '18 at 04:46

1 Answers1

3

You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:

Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...

The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.

While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:

John/B Smith/I ate/O an/O apple/O ./O

In columnar format it would look like this:

John B
Smith I
ate O
an O
apple O
. O

With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.

The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.

Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.


To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:

import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
    person = 'O' # default not a person
    if str(word) in names:
        person = 'B-PERSON'
    print(str(word), word.pos_, person)
polm23
  • 14,456
  • 7
  • 35
  • 59
  • Ok so the POS column must be provided by myself. Any recommendation on how to obtain this? How was it obtained for the various examples? Was it generated by a POS tagger, or manually entered by inspecting each token? I have a lot of data in my training sets. – erotavlas Jun 13 '18 at 17:41
  • I figured example code was best so please see my updated answer. – polm23 Jun 14 '18 at 05:51
  • What about titles - Dr. John Smith or Jim Doe M.D.? And what about a header like Name: Joe Smith, Are those included as a begin / end part of the name? – erotavlas Jun 18 '18 at 19:24
  • I considered them to not be part of the name, but you could make them part of the name or give them another tag, it's up to you. – polm23 Jun 19 '18 at 00:10
  • What happens if there is no line breaks between sentences? Just provide one continuous series of tokens. – erotavlas Jun 19 '18 at 21:59
  • It would prevent you from using start or end of sentence features, which would probably hurt your model, and you might run out of memory. But you can try it. – polm23 Jun 20 '18 at 01:31
  • Sorry one more question, The documentation doesn't explain very well what the features separated by / mean, for example this %x[0,0]/%x[0,1] expands to the/DT in their example. What does this do? – erotavlas Jun 20 '18 at 14:52
  • The / isn't special, it's just for readability. A feature like `%x[0,0]/%x[0,1]` is just a feature that is based on the token and its label rather than just one of them. You could also have a feature for just POS, or just the literal token, etc. I would recommend reading documentation for other CRF toolkits until you understand everything. – polm23 Jun 21 '18 at 00:38