Extract word terms from sentence between the symbol '<>' and the nested case '<<>>'

Question

Named Entity Recognition news dataset (text)

Here is a sample:

<LOC Qatar> and <LOC Japan>, who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups.

I'm trying to extract the entities which are between <>, the problem in the nested labels and the output is:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

It is wrong because "EVENT S Asian", "E Cup" should be one string not two.

I'v tried regEx but it doesn't work well.

import re
s = """<LOC Qatar> and <LOC Japan>, 
who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups."""
re.findall('\<.*?\>',s)

Actual results:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

Expected results:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian> <E Cup>>',
 '<DATE February>']

That is true the concept is almost the same, I was trying to implement in python. — amn89, Jun 24 '19 at 13:45

score 2 · Accepted Answer · answered Jun 24 '19 at 13:36

You want to apply recursive pattern as mentioned in comments. The regex module give you opportunity (not the re module).

Here the code:

# Import module
import regex as reg

# Your string
s = """<LOC Qatar> and <LOC Japan>, 
who met in the < EVENT < S Asian > < E Cup >> final in < DATE February > , are in third place in their groups. """

# Match pattern
my_list = reg.findall("<((?:[^<>]|(?R))*)>", s)
print(my_list)
# ['LOC Qatar', 'LOC Japan', ' EVENT < S Asian > < E Cup >', ' DATE February ']

if you really want the words surrounded by <>, you can add them:

my_list = ['<' + elt + '>' for elt in my_list]
print(my_list)
# ['<LOC Qatar>', '<LOC Japan>', '< EVENT < S Asian > < E Cup >>', '< DATE February >']

Thanks a lot, you are right. I was trying to implement ```(\<([^<>]|\)*\>)``` using ```re``` — amn89, Jun 24 '19 at 13:44

Extract word terms from sentence between the symbol '<>' and the nested case '<<>>'

1 Answers1