How to remove multiple spaces and characters from a XML files using regex with python?

Question

I have hundreds of lines in an XML file like these two examples:

<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
<settings site_id="moreID321" xmltv_id="More Text">More Text</settings>

I want to format with python regex everything inside of xmltv_id="HERE" without spaces, dashes or parentheses and add at the end .xx

xmltv_id="Some text - dummy (2) HH"
xmltv_id="More Text"

become like this

xmltv_id="Sometextdummy2HH.xx"
xmltv_id="MoreText.xx"

How can I do it?

Thanks!

thanks for your reply, I can do it with any tool that solve my issue. Just let's me know how to do it then I will try it. — Panita, Jun 26 '19 at 06:40
Possible duplicate of [How do I parse XML in Python?](https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) — tres.14159, Jun 26 '19 at 08:21
@tres.14159 although that post is about parse XML in Python doesn't solve my problem. — Panita, Jun 26 '19 at 08:54

score 1 · Answer 1 · answered Jun 26 '19 at 06:58

Consider the following approach - read & parse xml, modify data, write xml.

import xml.etree.ElementTree as ET

tree = ET.parse('1.xml')

for element in tree.findall('settings'):
    element.set('xmltv_id', element.get('xmltv_id').replace(' ', ''))

tree.write('2.xml')

Original xml 1.xml:

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    <settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
</note>

Modified xml 2.xml:

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    <settings site_id="someID123" xmltv_id="Sometext-dummy(2)HH">Some text - dummy (2) HH</settings>
</note>

score 1 · Accepted Answer · answered Jun 26 '19 at 07:01

Regex never be a robust and suitable approach when parsing structured data, such as XML/HTML. Use appropriate parsers.

with etree.ElementTree module and re.sub function:

import xml.etree.ElementTree as ET
import re

root = ET.parse('yourxml.xml').getroot()
pat = re.compile(r'[\s()-]+')    # regex character class for chars to replace

for el in root.findall('settings[@xmltv_id]'):
    el.set("xmltv_id", pat.sub('', el.get("xmltv_id")) + '.xx')

ET.dump(root)

Sample output:

<main>
  <settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>
  <settings site_id="moreID321" xmltv_id="MoreText.xx">More Text</settings>
</main>

You may easily save the resulting elementTree into a new file with https://docs.python.org/3.7/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write

score 0 · Answer 3 · answered Jun 26 '19 at 07:02

I don't think you can accomplish this with a single regex in python. A solution I can think of is something like this:

import re

def format_line(line):
    m = re.search('(.*xmltv_id=")(.*)(".*)', line)
    stripped_tag = re.sub(' |-|\(|\)','', m.group(2))
    return f'{m.group(1)}{stripped_tag}.xx{m.group(3)}'

>>> format_line('<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>')
'<settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>'

score -1 · Answer 4 · answered Jun 26 '19 at 06:49

-1

With re is:

import re

xmltv_id1="Some text - dummy (2) HH"
xmltv_id2="More Text"

replace_regex = r'\s|[-]|[(]|[)]'

print(re.sub(replace_regex, '', xmltv_id1) + '.xx'))
print(re.sub(replace_regex, '', xmltv_id2) + '.xx'))

answered Jun 26 '19 at 06:49

tres.14159

850
1
16
26

Well, it is a answer, it is not a free job for someone. The next steps are easy with [lxml](https://lxml.de/) for example. – tres.14159 Jun 26 '19 at 07:08
The next steps have nothing to do with your current stepts.. You need to to start from the correct attributes 'xmltv_id', which you are ignoring completely – WiseDev Jun 26 '19 at 07:11
Maybe you can check: https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python – tres.14159 Jun 26 '19 at 08:21
That answer focuses on a specific tag/attribute... which you do not. – WiseDev Jun 26 '19 at 08:22
In Spain there was a old website, it named RinconDelVago. this website had a lot of school works and the people only copied the work and did nonething. Do you want a free work or only a solution and you will add the code for your bussines? – tres.14159 Jun 26 '19 at 08:28

How to remove multiple spaces and characters from a XML files using regex with python?

4 Answers4