Using regular expression to replace xml in Python

Question

I want to convert the .mdx file to the .dictionary used by the MAC. So, I need to read and replace many lines in a .xml file. But my problem is there are so many different lines to replace. This is some of the xml file that need to be replaced：

<p>@@@LINK=ten pence</p>
<p>@@@LINK=twenty-twenty vision</p>
<p>@@@LINK=fifty pence</p>
<p>@@@LINK=abate</p>

And it will become:

<a href="x-dictionary:d:ten pence:dict_bundle_id">ten pence</a>
<a href="x-dictionary:d:twenty-twenty vision:dict_bundle_id">twenty-twenty vision</a>
<a href="x-dictionary:d:fifty pence:dict_bundle_id">fifty pence</a>
<a href="x-dictionary:d:abate:dict_bundle_id">abate</a>

[Regex is not a suitable tool for parsing XML](https://stackoverflow.com/q/8577060/4934172) or [HTML](https://stackoverflow.com/a/1732454/4934172). Please use an XML parser instead. — 41686d6564 stands w. Palestine, Jul 22 '18 at 08:58

CertainPerformance · Accepted Answer · 2018-07-22T08:32:00.640

0

Capture everything between the = and the </p> in the first group with

<p>@@@LINK=(.+?)</p>

and then you can replace with the desired format via

<a href="x-dictionary:d:\g<1>:dict_bundle_id">\g<1></a>

https://regex101.com/r/zOKlW8/2

edited Jul 22 '18 at 08:32

answered Jul 22 '18 at 08:24

CertainPerformance

356,069
52
309
320

Thanks, it works. I just know re.match(r'xx(xx)', 'xxxx'), then use m.group(0). So, I am confused about how to use other methods to achieve similar functions. It turns out that use \g<1> to solve this problem. – J.H Jul 22 '18 at 10:04

score 0 · Answer 2 · answered Jul 22 '18 at 09:27

I don't think this will be a scalable solution but this is how it could be done -

import re
first_pattern = u'[=].*?[<]' # this is to get the constant out like ten pence, fifty pence etc
second_pattern = u'(@){3}LINK[=]' # this is to match @@@LINK=
third_pattern = u'^[<]p[>]' # to match <p> at start of the string
fourth_pattern = u'[<][\/]p[>]' # to match </p> at the end of the string
replaced_list = []
# I don't know how your data is delimited so I delimited with ",", you can easily make it for readlines
input = "<p>@@@LINK=ten pence</p>,<p>@@@LINK=twenty-twenty vision</p>,<p>@@@LINK=fifty pence</p>,<p>@@@LINK=abate</p>"
# Below are the constants for your strings
constant1 = 'x-dictionary:d:'
constant2 = ':dict_bundle_id">'
constant3 = '<a href="'
constant4 = '</a>'
for line in input.split(","):
    const = re.search(first_pattern, line).group(0).replace("=", "").replace("<", "")
    edited_line = re.sub(second_pattern, constant1+const+consant2, line)
    edited_line = re.sub(third_pattern, constant3, edited_line)
    edited_line = re.sub(fourth_pattern, constant4, edited_line)
    replaced_list.append(edited_line)

OP -

['<a href="x-dictionary:d:ten pence:dict_bundle_id">ten pence</a>', '<a href="x-dictionary:d:twenty-twenty vision:dict_bundle_id">twenty-twenty vision</a>', '<a href="x-dictionary:d:fifty pence:dict_bundle_id">fifty pence</a>', '<a href="x-dictionary:d:abate:dict_bundle_id">abate</a>']

It would be a good idea to parse it with some xml parser

Using regular expression to replace xml in Python

2 Answers2