3

I have hundreds of lines in an XML file like these two examples:

<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
<settings site_id="moreID321" xmltv_id="More Text">More Text</settings>

I want to format with python regex everything inside of xmltv_id="HERE" without spaces, dashes or parentheses and add at the end .xx

xmltv_id="Some text - dummy (2) HH"
xmltv_id="More Text"

become like this

xmltv_id="Sometextdummy2HH.xx"
xmltv_id="MoreText.xx"

How can I do it?

Thanks!

Panita
  • 33
  • 3

4 Answers4

1

Consider the following approach - read & parse xml, modify data, write xml.

import xml.etree.ElementTree as ET

tree = ET.parse('1.xml')

for element in tree.findall('settings'):
    element.set('xmltv_id', element.get('xmltv_id').replace(' ', ''))

tree.write('2.xml')

Original xml 1.xml:

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    <settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>
</note>

Modified xml 2.xml:

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    <settings site_id="someID123" xmltv_id="Sometext-dummy(2)HH">Some text - dummy (2) HH</settings>
</note>
1

Regex never be a robust and suitable approach when parsing structured data, such as XML/HTML. Use appropriate parsers.

with etree.ElementTree module and re.sub function:

import xml.etree.ElementTree as ET
import re

root = ET.parse('yourxml.xml').getroot()
pat = re.compile(r'[\s()-]+')    # regex character class for chars to replace

for el in root.findall('settings[@xmltv_id]'):
    el.set("xmltv_id", pat.sub('', el.get("xmltv_id")) + '.xx')

ET.dump(root)

Sample output:

<main>
  <settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>
  <settings site_id="moreID321" xmltv_id="MoreText.xx">More Text</settings>
</main>

You may easily save the resulting elementTree into a new file with https://docs.python.org/3.7/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

I don't think you can accomplish this with a single regex in python. A solution I can think of is something like this:

import re

def format_line(line):
    m = re.search('(.*xmltv_id=")(.*)(".*)', line)
    stripped_tag = re.sub(' |-|\(|\)','', m.group(2))
    return f'{m.group(1)}{stripped_tag}.xx{m.group(3)}'
>>> format_line('<settings site_id="someID123" xmltv_id="Some text - dummy (2) HH">Some text - dummy (2) HH</settings>')
'<settings site_id="someID123" xmltv_id="Sometextdummy2HH.xx">Some text - dummy (2) HH</settings>'
Bogsan
  • 631
  • 6
  • 12
-1

With re is:

import re

xmltv_id1="Some text - dummy (2) HH"
xmltv_id2="More Text"

replace_regex = r'\s|[-]|[(]|[)]'

print(re.sub(replace_regex, '', xmltv_id1) + '.xx'))
print(re.sub(replace_regex, '', xmltv_id2) + '.xx'))
tres.14159
  • 850
  • 1
  • 16
  • 26
  • Well, it is a answer, it is not a free job for someone. The next steps are easy with [lxml](https://lxml.de/) for example. – tres.14159 Jun 26 '19 at 07:08
  • The next steps have nothing to do with your current stepts.. You need to to start from the correct attributes 'xmltv_id', which you are ignoring completely – WiseDev Jun 26 '19 at 07:11
  • Maybe you can check: https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python – tres.14159 Jun 26 '19 at 08:21
  • That answer focuses on a specific tag/attribute... which you do not. – WiseDev Jun 26 '19 at 08:22
  • In Spain there was a old website, it named RinconDelVago. this website had a lot of school works and the people only copied the work and did nonething. Do you want a free work or only a solution and you will add the code for your bussines? – tres.14159 Jun 26 '19 at 08:28