1

I have an input file in the below format:

<ftnt>
<p><su>1</su> aaaaaaaaaaa </p>
</ftnt>
...........
...........
...........
... the <su>1</su> is availabe in the .........

I need to convert this to the below format by replacing the value and deleting the whole data in ftnt tags:

"""...
...
... the aaaaaaaaaaa is available in the ..........."""

Please find the code which i have written. Initially i saved the keys & values in dictionary and tried to replace the value based on the key using grouping.

import re
dict = {}
in_file = open("in.txt", "r")
outfile = open("out.txt", "w")
File1 = in_file.read()

infile1 = File1.replace("\n", " ")
for mo in re.finditer(r'<p><su>(\d+)</su>(.*?)</p>',infile1):

     dict[mo.group(1)] = mo.group(2)

subval = re.sub(r'<p><su>(\d+)</su>(.*?)</p>','',infile1)
subval = re.sub('<su>(\d+)</su>',dict[\\1], subval)

outfile.write(subval)

I tried to use dictionary in re.sub but I am getting a KeyError. I don't know why this happens could you please tell me how to use. I'd appreciate any help here.

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
Fla-Hyd
  • 279
  • 7
  • 17
  • Use four spaces when formatting code in your question. It's much more legible and doesn't leave spaces in between pieces of code – TerryA Jan 24 '13 at 11:34
  • I edited the formatting, please correct if something looks wrong (use the [edit] link under the question). – Lev Levitsky Jan 24 '13 at 11:43
  • [Not again](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... – Karl Knechtel Jan 24 '13 at 12:27

2 Answers2

2

Try using a lambda for the second argument to re.sub, rather than a string with backreferences:

subval = re.sub('<su>(\d+)</su>',lambda m:dict[m.group(1)], subval)
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Krish
  • 21
  • 2
0

First off, don't name dictionaries dict or you'll destroy the dict function. Second, \\1 doesn't work outside of a string hence the syntax error. I think the best bet is to take advantage of str.format

import re

# store the substitutions
subs = {}

# read the data
in_file = open("in.txt", "r")
contents = in_file.read().replace("\n", " ")
in_file.close()

# save some regexes for later
ftnt_tag = re.compile(r'<ftnt>.*</ftnt>')
var_tag = re.compile(r'<p><su>(\d+)</su>(.*?)</p>')

# pull the ftnt tag out
ftnt = ftnt_tag.findall(contents)[0]
contents = ftnt_tag.sub('', contents)

# pull the su
for match in var_tag.finditer(ftnt):
    # added s so they aren't numbers, useful for format
    subs["s" + match.group(1)] = match.group(2)

# replace <su>1</su> with {s1}
contents = re.sub(r"<su>(\d+)</su>", r"{s\1}", contents)

# now that the <su> are the keys, we can just use str.format
out_file = open("out.txt", "w")
out_file.write( contents.format(**subs) )
out_file.close()
agoebel
  • 401
  • 4
  • 10