0

I have the following XML:

<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>

I am looking to remove '', '' as well as remove the open and closing tags from a beautiful soup element.

The output should be:

Color: blue green, install: no, days: 4

here is what I've tried:

new = re.sub(r'(/>)</data>.+', '</data>', new)

I'm just learning how to regex, please forgive the noobness.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
ApacheOne
  • 245
  • 2
  • 14
  • 1
    If available, try using a parser instead of regex. See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and https://stackoverflow.com/questions/4071696/python-beautifulsoup-xml-parsing – The fourth bird Nov 20 '20 at 23:58

2 Answers2

0

As per the comments, the results can be achieved using BeautifulSoup, instead of regex.

All you need to do is find the tag data. Then loop through findChildren(). This will allow you to capture the tag name and text.

For example:

from bs4 import BeautifulSoup

html = '''<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>'''

soup=BeautifulSoup(html,'lxml') 
data=soup.find('data')

results = []
for x in data.findChildren():
    results.append(f'{x.name}: {x.text.strip()}')

separator = ', '
print(separator.join(results))

Output

color: blue green, install: No, days: 4
Greg
  • 4,468
  • 3
  • 16
  • 26
0

if you want to use regex to extract your XML, could use this:

import re

txt = """
</data>\<data xmlns="">
<color>blue green</color>
<install>No</install>
<days>4</start>
</data>
"""
x = re.findall("(?<=<)([^\/>]+)>(.+)(?=<)", txt)

result=[]
for i in range(len(x)):
    result.append(x[i][0] +': ' + x[i][1] )
print(', '.join(result))

Output:

color: blue green, install: No, days: 4

See https://regex101.com/r/Yqtnkx/1

It doesn't have good performance, but I hope it helps you.

Heo
  • 266
  • 2
  • 10