-1

I want to convert the .mdx file to the .dictionary used by the MAC. So, I need to read and replace many lines in a .xml file. But my problem is there are so many different lines to replace. This is some of the xml file that need to be replaced:

<p>@@@LINK=ten pence</p>
<p>@@@LINK=twenty-twenty vision</p>
<p>@@@LINK=fifty pence</p>
<p>@@@LINK=abate</p>

And it will become:

<a href="x-dictionary:d:ten pence:dict_bundle_id">ten pence</a>
<a href="x-dictionary:d:twenty-twenty vision:dict_bundle_id">twenty-twenty vision</a>
<a href="x-dictionary:d:fifty pence:dict_bundle_id">fifty pence</a>
<a href="x-dictionary:d:abate:dict_bundle_id">abate</a>
J.H
  • 15
  • 1
  • 5

2 Answers2

0

Capture everything between the = and the </p> in the first group with

<p>@@@LINK=(.+?)</p>

and then you can replace with the desired format via

<a href="x-dictionary:d:\g<1>:dict_bundle_id">\g<1></a>

https://regex101.com/r/zOKlW8/2

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • Thanks, it works. I just know re.match(r'xx(xx)', 'xxxx'), then use m.group(0). So, I am confused about how to use other methods to achieve similar functions. It turns out that use \g<1> to solve this problem. – J.H Jul 22 '18 at 10:04
0

I don't think this will be a scalable solution but this is how it could be done -

import re
first_pattern = u'[=].*?[<]' # this is to get the constant out like ten pence, fifty pence etc
second_pattern = u'(@){3}LINK[=]' # this is to match @@@LINK=
third_pattern = u'^[<]p[>]' # to match <p> at start of the string
fourth_pattern = u'[<][\/]p[>]' # to match </p> at the end of the string
replaced_list = []
# I don't know how your data is delimited so I delimited with ",", you can easily make it for readlines
input = "<p>@@@LINK=ten pence</p>,<p>@@@LINK=twenty-twenty vision</p>,<p>@@@LINK=fifty pence</p>,<p>@@@LINK=abate</p>"
# Below are the constants for your strings
constant1 = 'x-dictionary:d:'
constant2 = ':dict_bundle_id">'
constant3 = '<a href="'
constant4 = '</a>'
for line in input.split(","):
    const = re.search(first_pattern, line).group(0).replace("=", "").replace("<", "")
    edited_line = re.sub(second_pattern, constant1+const+consant2, line)
    edited_line = re.sub(third_pattern, constant3, edited_line)
    edited_line = re.sub(fourth_pattern, constant4, edited_line)
    replaced_list.append(edited_line)

OP -

['<a href="x-dictionary:d:ten pence:dict_bundle_id">ten pence</a>', '<a href="x-dictionary:d:twenty-twenty vision:dict_bundle_id">twenty-twenty vision</a>', '<a href="x-dictionary:d:fifty pence:dict_bundle_id">fifty pence</a>', '<a href="x-dictionary:d:abate:dict_bundle_id">abate</a>']

It would be a good idea to parse it with some xml parser

Sushant
  • 3,499
  • 3
  • 17
  • 34