capture and remove beginning and end xml open and close tag with regex

Question

I have the following XML:

<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>

I am looking to remove '', '' as well as remove the open and closing tags from a beautiful soup element.

The output should be:

Color: blue green, install: no, days: 4

here is what I've tried:

new = re.sub(r'(/>)</data>.+', '</data>', new)

I'm just learning how to regex, please forgive the noobness.

If available, try using a parser instead of regex. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and https://stackoverflow.com/questions/4071696/python-beautifulsoup-xml-parsing — The fourth bird, Nov 20 '20 at 23:58

score 0 · Accepted Answer · answered Nov 21 '20 at 00:18

As per the comments, the results can be achieved using BeautifulSoup, instead of regex.

All you need to do is find the tag data. Then loop through findChildren(). This will allow you to capture the tag name and text.

For example:

from bs4 import BeautifulSoup

html = '''<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>'''

soup=BeautifulSoup(html,'lxml') 
data=soup.find('data')

results = []
for x in data.findChildren():
    results.append(f'{x.name}: {x.text.strip()}')

separator = ', '
print(separator.join(results))

Output

color: blue green, install: No, days: 4

Thank you @greg for the advise and the solution! – ApacheOne Nov 25 '20 at 02:33 — ApacheOne, Nov 25 '20 at 02:33

score 0 · Answer 2 · answered Nov 21 '20 at 01:03

if you want to use regex to extract your XML, could use this:

import re

txt = """
</data>\<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>
"""
x = re.findall("(?<=<)([^\/>]+)>(.+)(?=<)", txt)

result=[]
for i in range(len(x)):
    result.append(x[i][0] +': ' + x[i][1] )
print(', '.join(result))

Output:

color: blue green, install: No, days: 4

See https://regex101.com/r/Yqtnkx/1

It doesn't have good performance, but I hope it helps you.

capture and remove beginning and end xml open and close tag with regex

2 Answers2