1

I am trying to change some lines in an xml file. The below code can change title tag but I cannot change desc tag with re and replace. I am trying to upper case for all characters in title and desc tags.

foo.xml

<programme start="20200610110000 +0300" stop="20200610114000 +0300" channel="beIN SERIES SCI-FI HD">
    <title lang="tr">Charmed S2 B5</title>
    <category lang="tr">Life Style</category>
    <desc lang="tr">Tür: Fantastik
[the truth about kat and dogs, 2.sezon, 2019] mel ve maggie, kaybolan macy'yi̇ büyü yoluyla bulmaya çalişirken harry, farkli bi̇r metod dener...
1998 yapimi 'charmed' di̇zi̇si̇ni̇n yeni̇den çevri̇mi̇nde cadilik yeteneği̇ne sahi̇p üç kizkardeşi̇n hi̇kayesi̇ kaldiği yerden devam edi̇yor... Her bi̇ri̇ farkli güçlere sahi̇p mel, macy ve maggie'ni̇n doğaüstü kötücül güçlere karşi koyduğu 'charmed'in yeni̇ sezonunu kaçirmayin!</desc>
  </programme>

test.py

import os,re

file = open('foo.xml', 'r', encoding='utf8')
lines = file.readlines()
file.close()
c = open('new.xml', 'w', encoding='utf8')
for line in lines:
    title = re.search('<title lang=".*?">(.*?)<', line, re.IGNORECASE)
    desc = re.search('<desc lang=".*?">([^;]*)<\/desc>', line, re.MULTILINE)
    if title:
        title = title.group(1)
        l = line.replace(title, title.upper())
        c.write(l)

    else:
        if desc:
            desc = desc.group(1)
            n = line.replace(desc, desc.upper())
            c.write(n)
        else:
            c.write(line)

As Alexander Pushkarev suggested, I've changed the code as below, but the the new xml file is identical with the original file. What do I miss?

tree = ET.parse('foo.xml')
root = tree.getroot()
for child in root:
    # tree = ET.fromstring(xml_text)
    el = tree.find(".//title")
    el.text = el.text.upper()

    # Look for desc element
    el = tree.find(".//desc")
    el.text = el.text.upper()

tree.write('new.xml')
Community
  • 1
  • 1
mammalianps
  • 27
  • 1
  • 7

1 Answers1

4

Using regular expressions for processing XML is a bad idea: Why is it such a bad idea to parse XML with regex?

What you probably need is to use xml.etree.ElementTree:

>>> import xml.etree.ElementTree as ET
>>> xml_text = u'''<programme start="20200610110000 +0300" stop="20200610114000 +0300" channel="beIN SERIES SCI-FI HD">
...     <title lang="tr">Charmed S2 B5</title>
...     <category lang="tr">Life Style</category>
...     <desc lang="tr">Tür: Fantastik
... [the truth about kat and dogs, 2.sezon, 2019] mel ve maggie, kaybolan macy'yi̇ büyü yoluyla bulmaya çalişirken harry, farkli bi̇r metod dener...
... 1998 yapimi 'charmed' di̇zi̇si̇ni̇n yeni̇den çevri̇mi̇nde cadilik yeteneği̇ne sahi̇p üç kizkardeşi̇n hi̇kayesi̇ kaldiği yerden devam edi̇yor... Her bi̇ri̇ farkli güçlere sahi̇p mel, macy ve maggie'ni̇n doğaüstü kötücül güçlere karşi koyduğu 'charmed'in yeni̇ sezonunu kaçirmayin!</desc>
...   </programme>'''
# Now we parse the document
>>> tree = ET.fromstring(xml_text)
# Look for title element
>>> el = tree.find(".//title")
>>> el.text = el.text.upper()
>>> el.text
'CHARMED S2 B5'
# Look for desc element
>>> el = tree.find(".//desc")
>>> el.text = el.text.upper()
>>> el.text
"TÜR: FANTASTIK\n[THE TRUTH ABOUT KAT AND DOGS, 2.SEZON, 2019] MEL VE MAGGIE, KAYBOLAN MACY'Yİ BÜYÜ YOLUYLA BULMAYA ÇALIŞIRKEN HARRY, FARKLI BİR METOD DENER...\n1998 YAPIMI 'CHARMED' DİZİSİNİN YENİDEN ÇEVRİMİNDE CADILIK YETENEĞİNE SAHİP ÜÇ KIZKARDEŞİN HİKAYESİ KALDIĞI YERDEN DEVAM EDİYOR... HER BİRİ FARKLI GÜÇLERE SAHİP MEL, MACY VE MAGGIE'NİN DOĞAÜSTÜ KÖTÜCÜL GÜÇLERE KARŞI KOYDUĞU 'CHARMED'IN YENİ SEZONUNU KAÇIRMAYIN!"
>>> ET.tostring(tree)
b'<programme start="20200610110000 +0300" stop="20200610114000 +0300" channel="beIN SERIES SCI-FI HD">\n    <title lang="tr">CHARMED S2 B5</title>\n    <category lang="tr">Life Style</category>\n    <desc lang="tr">T&#220;R: FANTASTIK\n[THE TRUTH ABOUT KAT AND DOGS, 2.SEZON, 2019] MEL VE MAGGIE, KAYBOLAN MACY\'YI&#775; B&#220;Y&#220; YOLUYLA BULMAYA &#199;ALI&#350;IRKEN HARRY, FARKLI BI&#775;R METOD DENER...\n1998 YAPIMI \'CHARMED\' DI&#775;ZI&#775;SI&#775;NI&#775;N YENI&#775;DEN &#199;EVRI&#775;MI&#775;NDE CADILIK YETENE&#286;I&#775;NE SAHI&#775;P &#220;&#199; KIZKARDE&#350;I&#775;N HI&#775;KAYESI&#775; KALDI&#286;I YERDEN DEVAM EDI&#775;YOR... HER BI&#775;RI&#775; FARKLI G&#220;&#199;LERE SAHI&#775;P MEL, MACY VE MAGGIE\'NI&#775;N DO&#286;A&#220;ST&#220; K&#214;T&#220;C&#220;L G&#220;&#199;LERE KAR&#350;I KOYDU&#286;U \'CHARMED\'IN YENI&#775; SEZONUNU KA&#199;IRMAYIN!</desc>\n  </programme>'

If you have several title and desc elements - use findall

>>> import xml.etree.ElementTree as ET
>>> xml_text = u'''<programme start="20200610110000 +0300" stop="20200610114000 +0300" channel="beIN SERIES SCI-FI HD">
...     <title lang="tr">title1</title>
...     <category lang="tr">Life Style</category>
...     <desc lang="tr">desc1</desc>
...     <title lang="tr">title2</title>
...     <category lang="tr">Life Style</category>
...     <desc lang="tr">desc2</desc>
...   </programme>'''
# Now we parse the document
>>> tree = ET.fromstring(xml_text)
>>> els = tree.findall(".//title")
>>> for el in els:
>>>   el.text = el.text.upper()
>>> els = tree.findall(".//desc")
>>> for el in els:
>>>   el.text = el.text.upper()
>>> ET.tostring(tree)
Alexander Pushkarev
  • 1,075
  • 6
  • 19