Named Entity Recognition news dataset (text)
Here is a sample:
<LOC Qatar> and <LOC Japan>, who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups.
I'm trying to extract the entities which are between <>, the problem in the nested labels and the output is:
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian>',
'<E Cup>',
'<DATE February>']
It is wrong because "EVENT S Asian", "E Cup" should be one string not two.
I'v tried regEx but it doesn't work well.
import re
s = """<LOC Qatar> and <LOC Japan>,
who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups."""
re.findall('\<.*?\>',s)
Actual results:
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian>',
'<E Cup>',
'<DATE February>']
Expected results:
['<LOC Qatar>',
'<LOC Japan>',
'<EVENT <S Asian> <E Cup>>',
'<DATE February>']