Text Substitution Using Python Dictionary in re.sub

Question

I have an input file in the below format:

<ftnt>
<p><su>1</su> aaaaaaaaaaa </p>
</ftnt>
...........
...........
...........
... the <su>1</su> is availabe in the .........

I need to convert this to the below format by replacing the value and deleting the whole data in ftnt tags:

"""...
...
... the aaaaaaaaaaa is available in the ..........."""

Please find the code which i have written. Initially i saved the keys & values in dictionary and tried to replace the value based on the key using grouping.

import re
dict = {}
in_file = open("in.txt", "r")
outfile = open("out.txt", "w")
File1 = in_file.read()

infile1 = File1.replace("\n", " ")
for mo in re.finditer(r'<p><su>(\d+)</su>(.*?)</p>',infile1):

     dict[mo.group(1)] = mo.group(2)

subval = re.sub(r'<p><su>(\d+)</su>(.*?)</p>','',infile1)
subval = re.sub('<su>(\d+)</su>',dict[\\1], subval)

outfile.write(subval)

I tried to use dictionary in re.sub but I am getting a KeyError. I don't know why this happens could you please tell me how to use. I'd appreciate any help here.

Use four spaces when formatting code in your question. It's much more legible and doesn't leave spaces in between pieces of code — TerryA, Jan 24 '13 at 11:34
I edited the formatting, please correct if something looks wrong (use the [edit] link under the question). — Lev Levitsky, Jan 24 '13 at 11:43
[Not again](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... — Karl Knechtel, Jan 24 '13 at 12:27

score 2 · Answer 1 · edited Aug 22 '22 at 06:44

2

Try using a lambda for the second argument to re.sub, rather than a string with backreferences:

subval = re.sub('<su>(\d+)</su>',lambda m:dict[m.group(1)], subval)

edited Aug 22 '22 at 06:44

Karl Knechtel

62,466
11
102
153

answered Aug 22 '22 at 06:31

Krish

21
2

score 0 · Answer 2 · answered Jan 24 '13 at 11:58

First off, don't name dictionaries dict or you'll destroy the dict function. Second, \\1 doesn't work outside of a string hence the syntax error. I think the best bet is to take advantage of str.format

import re

# store the substitutions
subs = {}

# read the data
in_file = open("in.txt", "r")
contents = in_file.read().replace("\n", " ")
in_file.close()

# save some regexes for later
ftnt_tag = re.compile(r'<ftnt>.*</ftnt>')
var_tag = re.compile(r'<p><su>(\d+)</su>(.*?)</p>')

# pull the ftnt tag out
ftnt = ftnt_tag.findall(contents)[0]
contents = ftnt_tag.sub('', contents)

# pull the su
for match in var_tag.finditer(ftnt):
    # added s so they aren't numbers, useful for format
    subs["s" + match.group(1)] = match.group(2)

# replace <su>1</su> with {s1}
contents = re.sub(r"<su>(\d+)</su>", r"{s\1}", contents)

# now that the <su> are the keys, we can just use str.format
out_file = open("out.txt", "w")
out_file.write( contents.format(**subs) )
out_file.close()

Thank you so much. This was of great help to me. – Fla-Hyd Jan 24 '13 at 16:29 — Fla-Hyd, Jan 24 '13 at 16:29

Text Substitution Using Python Dictionary in re.sub

2 Answers2