I'm fighting with Stanford's SequenceMatchRules for recognizing the following input as two dates:
Anaximander (c. 610 – c. 546 BC) was a pre-Socratic Greek philosopher who lived in Miletus, a city of Ionia (in modern-day Turkey).
(taken from the Pantheon dataset, e.g. http://pantheon.media.mit.edu)
'546 BC' works just fine, but I also want to recognize '610' as '610 BC' (preferably NOT as a duration).
What I did so far just to get things going:
Modified english.sutime.txt
:
Changed
$POSSIBLE_YEAR = ( $YEAR /a\.?d\.?|b\.?c\.?/? | $INT /a\.?d\.?|b\.?c\.?/ | $INT1000TO3000 );
to
$POSSIBLE_YEAR = ( $YEAR /a\.?d\.?|b\.?c\.?/? | $INT /a\.?d\.?|b\.?c\.?/ | /c\.\ / $INT | $INT1000TO3000 );
And in the pattern: ( $POSSIBLE_YEAR)...
extraction rule:
Tag($0, "YEAR_ERA",
:case {
$0 =~ ( $INT /a\.?d\.?/ ) => ERA_AD,
$0 =~ ( $INT /b\.?c\.?/ ) => ERA_BC,
:else => ERA_UNKNOWN
}
)
to
Tag($0, "YEAR_ERA",
:case {
$0 =~ ( $INT /a\.?d\.?/ ) => ERA_AD,
$0 =~ ( /c\.\ / $INT ) => ERA_BC,
$0 =~ ( $INT /b\.?c\.?/ ) => ERA_BC,
:else => ERA_UNKNOWN
}
)
First it's ugly, second it didn't work at all.
Where should I begin to get this right?
I'm using the stanford-corenlp-full-2018-10-05
.
I should mention that Pantheon is not perfectly normalized, so I have to deal with additional stuff like CE/BCE, missing spaces around dates etc later. Therefore an extendable approach would be great.