I am trying to create dummy data for NER
task by replacing person_name
with some dummy names. But it's giving me weird results in case of same entities occuring multiple times as discussed here:
Strange result when removing item from a list while iterating over it
Modifying list while iterating
Input example spans:
{
'text':"Mohan dob is 25th dec 1980. Mohan loves to play cricket.",
'spans':[{'start':0, 'end':5,'label':'person_name','ngram':'Mohan'},
{start':28, 'end':33,'label':'person_name','ngram':'Mohan'},
{start':13, 'end':26,'label':'date','ngram':'25th dec 1980'}
]
}
The entities person_name
occurs twice in a sample.
sample_names=['Jon', 'Sam']
I want to replace (0, 5, 'person_name')
and (28, 33, 'person_name')
with sample_names
.
Dummy Examples Output:
{
{'text':"Jon dob is 25th dec 1980. Jon loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Jon'},
{start':26, 'end':31,'label':'person_name','ngram':'Jon'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
},
{'text':"Sam dob is 25th dec 1980. Sam loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Sam'},
{start':26, 'end':31,'label':'person_name','ngram':'Sam'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
}
}
The spans also get's updated in output
target_entity='person_name'
names=sample_names
Code:
def generate(data, target_entity, names):
text = data['text']
spans = data['spans']
new_sents=[]
if spans:
spans = [(d['start'], d['end'], d['label']) for d in spans]
spans.sort()
labellist=[s[2] for s in spans]
# get before_spans and after_spans around target entity
for n in names:
gap = 0
for i, tup in enumerate(spans):
lab = tup[2]
if lab == target_entity:
new_spans={"before": spans[:i], "after": spans[i+1:]}
print("the spans before and after :\n",new_spans)
start=tup[0] #check this
end=tup[1]
ngram = text[start:end]
new_s = text[:start] + n + text[end:]
gap = len(n) - len(ngram)
before = new_spans["before"]
after = [(tup[0]+gap, tup[1]+gap, tup[2]) for tup in new_spans["after"]]
s_sp = before + [(start, start + len(n), target_label)] + after
text=new_s
en={"text": new_s,"spans": [{"start": tup[0], "end": tup[1], "label": tup[2], "ngram": new_s[tup[0]:tup[1]]} for tup in s_sp]}
spans = s_sp
new_sents.append(en)