1

I have a .txt file (with a kind of XML code) that I am trying to restructure. I have 2 questions about things not working the way I want them to. (Both problems have been solved by the comments of Wiktor).

The file looks like this:

<str name="name">John</str>
<date name="year">2021</date>
<arr name="food">
   <str>Pizza</str>
   <str>Meat</str>
</arr>

I want to restructure this text into this correct XML structure:

<name>John</name>
<year>2021</year>
<food>
   Pizza
   Meat
</food>

To achieve this, I already made a regular expression:

<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>

You also can find the regular expression HERE, with the small sample string.

The first question: As you can see on Pythex, the str and date parts are recognized correctly, but the array part is not. This is because \n is not part of the . symbol in the regular expression. I can activate this with the dotall parameter. But when I do that, the entire file becomes one match. Which makes sense. However, I want to have separate matches, as happens with the str and date parts when dotall is not active. The first question: How can I make sure that the part between <arr ...> and </arr> is seen as a match, without searching further after the </arr>? I need the match captures for each individual match that you can see on the right. So from <arr it should work, including \n, until </arr> and then it should stop.

The second question: I want to use the match* captures you see at on the right (at Pythex) to assemble the new structure. So I need a method that allows me to use those pieces text from the regular expression to replace the original text with. I read that this can be done with the compile method of the re package. But it's not working. This is my code:

from re import compile

file = open("file.txt")
content = file.read()
p = compile('<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>')
p.sub('<\\2>\\3</\\2>', content)

print(content)

The new structure on the p.sub line may not be completely correct, but that's not the problem: If i use p.sub('test', content), and I print the content at the end of the code, the matches are also not replaced by 'test'. The content is like it was at the beginning. So, the entire function doesn't seem to work. What am I doing wrong?

lakeviking
  • 322
  • 1
  • 6
  • 18
  • 1
    [One doesn't simply parse XML with regex](https://stackoverflow.com/a/1732454/770830). – bereal Feb 11 '21 at 19:24
  • 1
    `re.sub` returns the new value, use `content = p.sub('<\\2>\\3\\2>', content)`. but you need `r'(?s)<(str|date|arr|int|long)\b.*?="(.+?)">(.*?)\1>'` regex. See the [regex demo](https://regex101.com/r/nMaT0s/2). – Wiktor Stribiżew Feb 11 '21 at 19:33
  • @bereal It's not an XML file. It's just a .txt file with wrong structured XML data. But it's a .txt file, so we can treat it like that – lakeviking Feb 11 '21 at 19:33
  • Thanks @WiktorStribiżew, question two has been solved! :) – lakeviking Feb 11 '21 at 19:37
  • 1
    Did you try the `p = compile(r'<(str|date|arr|int|long)\b.*?="(.*?)">(.*?)\1>', re.I | re.S)` + `content = p.sub(r'<\2>\3\2>', content)`? – Wiktor Stribiżew Feb 11 '21 at 19:45
  • @lakeviking it's well-formed XML except it doesn't have a root element, which is easy to fix. – bereal Feb 11 '21 at 19:53
  • Thanks @WiktorStribiżew, that also worked! Both problems solved now. Thank you very much! – lakeviking Feb 11 '21 at 20:24

2 Answers2

1

You need to make sure the pattern matches across lines by adding the re.S or re.DOTALL flag, the .* must be made non-greedy by using the lazy dot, .*?, and you need to make sure the close tags are the same as open tags (by means of an inline backreference). Also, do not forget you need to assign the result of re.sub to a variable, since strings are immutable in Python.

You need to use

p = compile(r'<(str|date|arr|int|long)\b.*?="(.*?)">(.*?)</\1>', re.I | re.S)
content = p.sub(r'<\2>\3</\2>', content)

See the regex demo.

Details

  • < - a < char
  • (str|date|arr|int|long) - Capturing group 1: any of the alternative substrings
  • \b - a word boundary
  • .*? - zero or more chars (but as few as possible)
  • =" - a =" substring
  • (.*?) - Group 2: any zero or more chars as few as possible
  • "> - a "> substring
  • (.*?) - Group 3: any zero or more chars as few as possible
  • </\1> - </, same value as in Group 1, and a >.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

This site can be helpful with regex: https://www.rexegg.com/regex-conditionals.html

I’m not an expert at Regex, but I think adding the /n parameter is necessary, similar to how you checked for 0+ wildcards.

Edited: <(str|date|arr|int|long).="(.+)">\n(.)\n</(str|date|arr|int|long)>

You could try that? Again, I’m no expert on regex. Just trying to lend a helping hand.

JTorres
  • 24
  • 2
  • Yes, I was already doubting because of that. But that didn't work (I tried it like this : <(str|date|arr|int|long).*="(.+)">(.*\n*)(str|date|arr|int|long)> ), but that didn't help. Also tried your regex, but that doesn't deliver matches. But, the second comment of Wiktor solved it. I can't follow what it does, but it works. Thanks anyway for the help! :) – lakeviking Feb 11 '21 at 20:23