Deleting Wordwraps in Python with regex

Question

I want to delete specific Wordwraps in a file.

The file looks like this:

<Text>
<TextNr>0</TextNr>
<TextStr>AckReq</TextStr>
</Text>
<Text>
<TextNr>1</TextNr>
<TextStr>AckReq</TextStr>
</Text>

And after the deleting Wordwrap function it should be:

<Text><TextNr>0</TextNr><TextStr>AckTra</TextStr></Text>
<Text><TextNr>1</TextNr><TextStr>AckReq</TextStr></Text>

So after <Text> it should delete all Wordwraps until </Text> and there it should make a new line. How can I delete Wordwraps using a regex?

The Regex is something like this:

r'<Text>[\r\n]+<TextNr>(\d+)</TextNr>[\r\n]+<TextStr>(\w+)</TextStr>[\r\n]+</Text>[\r\n]+'

I would remind you of the following: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags This question is dangerously close to wanting to parse xml with regex. — PiRocks, Jul 13 '20 at 06:56
What do you mean by "The Regex is something like this:"? Is that what you tried? How did it fail? Please show sample input, the desired output, the output you get, a description of the relevant differences. Also please explain any other problem you might have with that regex and what you tried to fix it. — Yunnosch, Jul 13 '20 at 06:57
@Yunnosch i dont know how to delte a wordwrap thats my problem — noggy, Jul 13 '20 at 07:01
Yes, that is *why* you ask. I am trying to help you with improving *how* you ask. You need to provide more information and demonstrate what you have tried yourself. — Yunnosch, Jul 13 '20 at 07:08
I tried some functions to delte the wordwraps but it never went good. In the py file is a function which saves the text from a xml file in a variable. Then i want to make the wordwrap function which i dont have to delte wordwraps. and after it should save the less wordwraped file in a variable and then in a file — noggy, Jul 13 '20 at 07:11

score 1 · Answer 1 · answered Jul 13 '20 at 07:00

1

You just need \n(?!<Text>), though as @PiRocks mentioned in the comments, this can get dangerous quickly if your XML gets any more complicated.

import re

text = """<Text>
<TextNr>0</TextNr>
<TextStr>AckReq</TextStr>
</Text>
<Text>
<TextNr>1</TextNr>
<TextStr>AckReq</TextStr>
</Text>"""

text = re.sub(r"\n(?!<Text>)", "", text)
print(text)

Output:

<Text><TextNr>0</TextNr><TextStr>AckReq</TextStr></Text>
<Text><TextNr>1</TextNr><TextStr>AckReq</TextStr></Text>

Demo

answered Jul 13 '20 at 07:00

jdaz

5,964
2
22
34

What you mean with this can get dangerous quickly if your XML gets any more complicated?. Because this is just a part from my xml file it is way bigger. – noggy Jul 13 '20 at 07:06
Read the comment by PiRocks on your question, follow the link. Try to understand what the most upvoted answer there tries to illustrate. It is the huge extend of possible and correct XML/HTML. Using regex means making assumptions on simpler input. – Yunnosch Jul 13 '20 at 07:10
"Dangerous" is probably the wrong word in this case since all you're trying to do is remove newlines, but if you try to use regex to find content within arbitrary tags, you are going to have a bad time. Also, my code above will not work properly if you have mismatched tags like `AckReq`, and that is not solvable with regex. – jdaz Jul 13 '20 at 07:20
okey, how should the regex be when after only a Nummber and after only a string possible should be, Thank you for your answer – noggy Jul 13 '20 at 07:23
Why do you want to check that, are you trying to verify that the XML is valid? If so, that is exactly the type of thing that regex cannot do properly, you need an XML parser. Because what if you had `BlahString`, or something similar? It gets very complicated very quickly. Look into [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [lxml](https://lxml.de/) or [etree](https://docs.python.org/3/library/xml.etree.elementtree.html) if you are going to be working with a lot of XML or HTML. – jdaz Jul 13 '20 at 07:32

Deleting Wordwraps in Python with regex

1 Answers1