I have a .txt file (with a kind of XML code) that I am trying to restructure. I have 2 questions about things not working the way I want them to. (Both problems have been solved by the comments of Wiktor).
The file looks like this:
<str name="name">John</str>
<date name="year">2021</date>
<arr name="food">
<str>Pizza</str>
<str>Meat</str>
</arr>
I want to restructure this text into this correct XML structure:
<name>John</name>
<year>2021</year>
<food>
Pizza
Meat
</food>
To achieve this, I already made a regular expression:
<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>
You also can find the regular expression HERE, with the small sample string.
The first question:
As you can see on Pythex, the str
and date
parts are recognized correctly, but the array
part is not. This is because \n
is not part of the .
symbol in the regular expression. I can activate this with the dotall
parameter. But when I do that, the entire file becomes one match. Which makes sense. However, I want to have separate matches, as happens with the str
and date
parts when dotall
is not active. The first question: How can I make sure that the part between <arr ...>
and </arr>
is seen as a match, without searching further after the </arr>
? I need the match captures for each individual match that you can see on the right. So from <arr
it should work, including \n
, until </arr>
and then it should stop.
The second question: I want to use the match* captures you see at on the right (at Pythex) to assemble the new structure. So I need a method that allows me to use those pieces text from the regular expression to replace the original text with. I read that this can be done with the compile
method of the re
package. But it's not working. This is my code:
from re import compile
file = open("file.txt")
content = file.read()
p = compile('<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>')
p.sub('<\\2>\\3</\\2>', content)
print(content)
The new structure on the p.sub
line may not be completely correct, but that's not the problem: If i use p.sub('test', content)
, and I print the content
at the end of the code, the matches are also not replaced by 'test'
. The content is like it was at the beginning. So, the entire function doesn't seem to work. What am I doing wrong?