0

I am trying to create dummy data for NER task by replacing person_name with some dummy names. But it's giving me weird results in case of same entities occuring multiple times as discussed here:

Strange result when removing item from a list while iterating over it

Modifying list while iterating

Input example spans:

{
 'text':"Mohan dob is 25th dec 1980. Mohan loves to play cricket.", 
'spans':[{'start':0, 'end':5,'label':'person_name','ngram':'Mohan'},
         {start':28, 'end':33,'label':'person_name','ngram':'Mohan'},
         {start':13, 'end':26,'label':'date','ngram':'25th dec 1980'}
        ]
}

The entities person_name occurs twice in a sample.

sample_names=['Jon', 'Sam']

I want to replace (0, 5, 'person_name') and (28, 33, 'person_name') with sample_names.

Dummy Examples Output:

 {

{'text':"Jon dob is 25th dec 1980. Jon loves to play cricket.", 
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Jon'},
         {start':26, 'end':31,'label':'person_name','ngram':'Jon'},
         {start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
        ]
},

{'text':"Sam dob is 25th dec 1980. Sam loves to play cricket.", 
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Sam'},
         {start':26, 'end':31,'label':'person_name','ngram':'Sam'},
         {start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
        ]
}
}

The spans also get's updated in output

target_entity='person_name'
names=sample_names

Code:

def generate(data, target_entity, names):

text = data['text']
spans = data['spans']
new_sents=[]

if spans:
    spans = [(d['start'], d['end'], d['label']) for d in spans]
    spans.sort()
    labellist=[s[2] for s in spans]
    # get before_spans and after_spans around target entity
    
    for n in names:
        gap = 0
        for i, tup in enumerate(spans):
            lab = tup[2]
            if lab == target_entity: 
                
                new_spans={"before": spans[:i], "after": spans[i+1:]}
                print("the spans before and after  :\n",new_spans)
                start=tup[0] #check this
                end=tup[1]
                ngram = text[start:end]
                new_s = text[:start] + n + text[end:]
                gap = len(n) - len(ngram)
                
                before = new_spans["before"]
                after = [(tup[0]+gap, tup[1]+gap, tup[2]) for tup in new_spans["after"]]
                s_sp = before + [(start, start + len(n), target_label)] + after
                text=new_s
                en={"text": new_s,"spans": [{"start": tup[0], "end": tup[1], "label": tup[2], "ngram": new_s[tup[0]:tup[1]]} for tup in s_sp]}
                spans = s_sp
                
        new_sents.append(en)
MAC
  • 1,345
  • 2
  • 30
  • 60

1 Answers1

0

If all you seek to do is replace the placeholder with a new value, you can do something like this:

## --------------------
## Some enxaple input from you
## --------------------
input_data = [
    (162, 171, 'pno'),
    (241, 254, 'person_name'),
    (373, 384, 'date'),
    (459, 477, 'date'),
    None,
    (772, 785, 'person_name'),
    (797, 806, 'pno')
]
## --------------------

## --------------------
## create an iterator out of our name list
## you will need to decide what happens if sample names
## gets exhausted.
## --------------------
sample_names = [
    'Jon',
    'Sam'
]
sample_names_itter = iter(sample_names)
## --------------------

for row in input_data:
    if not row:
        continue

    start = row[0]
    end = row[1]
    name = row[2] if row[2] != "person_name" else next(sample_names_itter)

    print(f"{name} dob is 25th dec 1980. {name} loves to play cricket.")
JonSG
  • 10,542
  • 2
  • 25
  • 36