Altering string using a list of dictionaries

Question

Background

I am using NeuroNER http://neuroner.com/ to label text data sample_string as seen below.

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2000 and her number is 1111112222'

Output (using NeuroNER)

My output is a list of dictionary dic_list

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2000'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '1111112222'}]

Legend

id = text ID

type = type of text being identified

start = starting position of identified text

end = ending position of identified text

text = text that is identified

Goal

Since the location of the text(e.g. Jane) is given by start and end, I would like to change each text from dic_list to **BLOCK** in my list sample_string

Desired Output

sample_string = 'Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**

Question

I have tried Replacing a character from a certain index and Edit the values in a list of dictionaries? but they are not quite what I am looking for

How do I achieve my desired output?

Please show the actual code you have tried to use and explain what specifically didn't work. — mkrieger1, Jul 11 '19 at 14:39
note: the start and end don't seem to match the length of "text" in some fields, or the position in the file. — Adam.Er8, Jul 11 '19 at 14:44

score 1 · Answer 1 · edited Oct 11 '19 at 17:35

I may be missing something but you can just use .replace():

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

for dic in dic_list:
    sample_string = sample_string.replace(dic['text'], '**BLOCK**')
print(sample_string)

Though regex will probably be faster:

import re
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

pattern = re.compile('|'.join(dic['text'] for dic in dic_list))
result = pattern.sub('**BLOCK**', sample_string)
print(result)

Both output:

Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**

this will work unless there might be other parts of the text that match the parts to replace, that shouldn't be replaced. (e.g. someone named "seen" or something) — Adam.Er8, Jul 11 '19 at 14:42
Yeah I'lll update it. Though if they are just removing sensetive information this should work. It would help if `start` and `end` lined up properly. — Error - Syntactical Remorse, Jul 11 '19 at 14:45

score 1 · Accepted Answer · edited Aug 25 '19 at 20:52

1

If you want a solution based on the start and end indexes,

you can use the intervals between those is dic_list, to know which parts you need. then join them with **BLOCK**.

try this:

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

parts_to_take = [(0, dic_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(dic_list, dic_list[1:])] + [(dic_list[-1]['end'], len(sample_string)-1)]
parts = [sample_string[start:end] for start, end in parts_to_take]

sample_string = '**BLOCK**'.join(parts)

print(sample_string)

edited Aug 25 '19 at 20:52

SFC

733
2
11
22

answered Jul 11 '19 at 15:07

Adam.Er8

12,675
3
26
38

1

There is a slightly more readable way to do this, see [this solution](https://repl.it/repls/SelfreliantImpracticalNanotechnology). – Error - Syntactical Remorse Jul 11 '19 at 15:21
@Error-SyntacticalRemorse nice, working in-place on the input string itself. – Adam.Er8 Jul 11 '19 at 15:27
That link may go dead but you can add it to your answer as another option. My answer uses replace and yours does start and end. – Error - Syntactical Remorse Jul 11 '19 at 15:29

SFC · Answer 3 · 2019-10-11T17:33:13.020

per the suggestion of @ Error - Syntactical Remorse

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

offset = 0
filler = '**BLOCK**'
for dic in dic_list:
    sample_string = sample_string[:dic['start'] + offset ] + filler + sample_string[dic['end'] + offset + 1:]
    offset += dic['start'] - dic['end'] + len(filler) - 1
print(sample_string)

Altering string using a list of dictionaries

3 Answers3