For integer/dates values annotated using Prodigy, does the spaCy model learn the range of values as well?

Question

I have a prodigy session set up to annotate certain numeric values in a document for age (ranges from 0 to 100). I am only annotating the number. My question is, suppose there is a corrupt value which crept in (age being 1000 or 22.7), will the model understand that even though it is close to the age text in the document, it should not be picked up?

In other words, can it learn the range of integer values, and if it does, will that work for date format as well? For instance a date in the format dd/mm/yyyy which is DOB (all the annotated ones are < 01/01/2000) and there is a date 31/12/2020, will that get picked up as well since all the annotated dates are nowhere close to this range?

Thank you

score 0 · Accepted Answer · answered Mar 25 '21 at 03:44

Good question! spaCy does not internally represent numeric tokens as numbers, so it doesn't have an explicit concept of the values. In that sense it can't tell between valid and invalid values for age.

However, spaCy does use "shape" features when representing tokens that will help it recognize valid ages. There are different kinds of shape tokens, but the one spaCy uses will represent words by converting characters to a representation of the character type. It works like this:

spaCy → xxxXx
fish → xxxx
Fish → Xxxx
23 → dd
1000 → dddd
22.7 → dd.d

Because of this you could expect that spaCy learns that two-digit numbers are likely to be ages, but numbers with decimals or four digits aren't likely. On the other hand, this doesn't help it differentiate between 100 and 999.

For dates this will not help with determining valid or invalid birthdates. Shape is just one of spaCy's features, but other features like prefix and suffix aren't really going to help with this either.

Since it's easy to verify numeric values in code, what I would suggest is matching broadly in spaCy and then using your own function to check whether dates or ages are valid by parsing them.

Outside of spaCy in particular, the question of how NLP models represent numeric values is actually an increasingly popular research topic - if you'd like to know more about it this is a recent article on the topic: Do Language Models Know How Heavy an Elephant Is?

Thank you so much for that, really helpful. However, it seems that the word embeddings for numbers are represented differently. For eg `I got 10 apples for 50c` the 2 numbers would have different embeddings right? I thought that is what is used represent the numbers internally and not the shape (apologies if I am wrong). Then the model would know the difference between 111 and 999? — ren1199, Mar 29 '21 at 16:28
Yes, to be clear the shape property is only one feature used to represent tokens. So 111 and 999 are not exactly the same, just their shapes are. — polm23, Mar 30 '21 at 03:58

For integer/dates values annotated using Prodigy, does the spaCy model learn the range of values as well?

1 Answers1