Sample size for Named Entity Recognition gold standard corpus

Question

I have a corpus of 170 Dutch literary novels on which I will apply Named Entity Recognition. For an evaluation of existing NER taggers for Dutch I want to manually annotate Named Entities in a random sample of this corpus – I use brat for this purpose. The manually annotated random sample will function as the 'gold standard' in my evaluation of the NER taggers. I wrote a Python script that outputs a random sample of my corpus on the sentence level.

My question is: what is the ideal size of the random sample in terms of the amount of sentences per novel? For now, I used a random 100 sentences per novel, but this leads to a pretty big random sample containing almost 21626 lines (which is a lot to manually annotate, and which leads to a slow working environment in brat).

Welcome to NLP, where human annotation is so expensive that it isn't even funny. :) — erip, Nov 22 '16 at 13:59
:-) But aren't there any standards for sample sizes for the kind of experiments I'm doing? — roelmetgevoel, Nov 22 '16 at 14:02
It depends a lot on how many entity-types you have, the size of the vocabulary (i.e., smaller domains tend to do better), etc. Lots of variables. — erip, Nov 22 '16 at 14:06
As to using brat: In my personal opinion, that is overkill; Brat is used for building complex syntactical annotations ("events"), while you only need to tag the named entities. Better options for that are: GATE, [WebAnnotator](https://addons.mozilla.org/en-US/firefox/addon/webannotator/) (a Firefox plugin), [MyMiner](http://myminer.armi.monash.edu.au/entity_modified.php) (disclaimer: was involved in that...), [AnnotateIt](http://annotateit.org/), or [Marky](http://sing.ei.uvigo.es/marky/) (disclaimer: I know those guys...). — fnl, Nov 23 '16 at 07:57
Hmmm. I just realized that I believe this question actually should be asked on CrossValidated... — fnl, Nov 23 '16 at 16:30

fnl · Answer 1 · 2016-11-23T16:22:45.760

2

NB, before the actual answer: The biggest issue I see is that you only can evaluate the tools wrt. those 170 books. So at best, it will tell you how good the NER tools you evaluate will work on those books or similar texts. But I guess that is obvious...

As to sample sizes, I would guesstimate that you need no more than a dozen random sentences per book. Here's a simple way to check if your sample size is already big enough: Randomly choose only half of the sentences (stratified per book!) you annotated and evaluate all the tools on that subset. Do that a few times and see if results for the same tool varies widely between runs (say, more than +/- 0.1 if you use F-score, for example - mostly depending on how "precise" you have to be to detect significant differences between the tools). If the variances are very large, continue to annotate more random sentences. If the numbers start to stabilize, you're good and can stop annotating.

edited Nov 23 '16 at 16:22

answered Nov 23 '16 at 07:57

fnl

4,861
4
27
32

I won't be using the annotated sample to train a NER tool. I will use the random sample to evaluate existing NER tools (which are trained on varied trainingsets themselves) for accuracy when applied to contemporary Dutch literary fiction. For those purposes, would it also suffice to use a random sample of (for instance) 10 sentences per novel? – roelmetgevoel Nov 23 '16 at 09:09
1

Wait a minute; You don't want to use this data to develop a NER tool, but rather to evaluate existing ones? So you only need to find out which of the NER tools is the best one to analyze those 170 books (and maybe some similar books)? But sure, 1700 samples should be more than enough to *evaluate* existing systems. – fnl Nov 23 '16 at 16:08
1

updated my answer after realizing that OP didn't want to build a NER tool, only evaluate the existing ones (missed that point initially...) – fnl Nov 23 '16 at 16:23

score 1 · Answer 2 · answered Nov 23 '16 at 10:44

Indeed, the "ideal" size would be... the whole corpus :)

Results will be correlated to the degree of detail of the typology: just PERS, LOC, ORG would require require a minimal size, but what about a fine-grained typology or even full disambiguation (linking)? I suspect good performance wouldn't need much data (just enough to validate), whilst low performance should require more data to have a more detailed view of errors.

As an indicator, cross-validation is considered as a standard methodology, it often uses 10% of the corpus to evaluate (but the evaluation is done 10 times).

Besides, if working with ancient novels, you will probably face lexical coverage problem: many old proper names would not be included in available softwares lexical resources and this is a severe drawback for NER accuracy. Thus it could be a nice idea to split corpus according to decades / centuries and conduct multiple evaluation so as to measure the impact of this trouble on performances.

The whole corpus won't be possible obviously, too time consuming :-) I will only use PERS, LOC and ORG and won't be doing any disambiguation; it's just for an evaluation of straight forward named entity extraction for Dutch literary fiction. I will be working with novels published in 2013, no lexical coverage problem in that respect. Because brat has a really slow performance for big document sizes, I am now experimenting with a random sample of 10 sentences per novel (2280 lines in total) – do you think that would suffice? — roelmetgevoel, Nov 23 '16 at 11:30
Yes, I guess it could be sufficient as long as you have acceptable performance and as mentioned by @fnl if the variance is not too large — eldams, Dec 04 '16 at 15:25

Sample size for Named Entity Recognition gold standard corpus

2 Answers2