1

I have to create training data set for named-entity recognition project.

For example, I have text

"Last year, I was in London where I saw Tom"

Training data should be

"Last year, I was in <ENAMEX TYPE="LOCATION">London</ENAMEX> where I saw  
<ENAMEX TYPE="NAME">Tom</ENAMEX>"

It is easy to do it by hand but it takes time when there are a large number of data. I can not use an open set. I have small training data set but I should extend it.

How can I create a larger training data set by extending small training data set? Are there some ready packages or open projects for it? Or do you suggest different methods?

angel-a
  • 31
  • 1
  • 1
  • 3

1 Answers1

1

First, if you aren't already, use a tool like brat to make annotating go faster.

Since it looks like you're marking tokens that are only ever used in one way, you can make a list of them and auto-annotate them. For example, London is always a place you so you can replace all instances of London with <ENAMEX TYPE="LOCATION">London</ENAMEX>. Be careful of cases where this doesn't work, like Turkey or China (We ate turkey sandwiches off china plates.).

There's a project called Prodigy in beta that's designed for getting models off the ground, though I haven't had a chance to try it yet it should be worth a look.

polm23
  • 14,456
  • 7
  • 35
  • 59