0

Using Python 2.7 and lxml, how do I modify XML elements with multiple values?

E.g.

    <Title>
      <Playcount>1</Playcount>
      <Genre>Adventure</Genre>
      <Genre>Comedy</Genre>
      <Genre>Action</Genre>
    </Title>

It is straight forward to modify Playcount, as it has a single value. How do I modify Genre, witch has multiple values?

e.g:

  1. How do I delete all but the first genre?

  2. How do I add a genre?

  3. How do I modify all Baseball genre to Sports?

Thanks.

Imagine
  • 85
  • 1
  • 10
  • All questions point to one answer when it comes to XML modification/transformation: **XSLT** (the special-purpose declarative language designed specifically to manipulate xml docs). And Python's lxml can adequately process xsl scripts. – Parfait Mar 14 '16 at 00:02
  • 1
    Your question will be easier to deal with if you think of it in more technically correct terms. Playcount has content, which is it's value. Genres has no content, thus no value. All it has are children, which will be represented as a sequence. Each child is a Genre element which has a single value, it's content. So you'd locate the Genres element and then iterate over or access elements of it's child-element sequence, manipulating each one as appropriate for your task. Look at [the `lxml` documentation](http://lxml.de/index.html#documentation) for how to do this. – Todd Knarr Mar 14 '16 at 00:08

2 Answers2

2

Like this::

from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring("""<Title>
    <Playcount>1</Playcount>
     <Genre>Adventure</Genre>
     <Genre>Comedy</Genre>
     <Genre>Action</Genre>
     <someTag>Text</someTag>
    </Title>""", parser=parser)

New playcount:

playcount = tree.find('Playcount')
playcount.text = "2"

Delete genres (not first):

title = tree.xpath('/Title')[0]
genres = title.xpath('Genre')

for element in genres:
    if element.tag == "Genre" and element != title.xpath("Genre[1]")[0]:
        element.getparent().remove(element)

New genre:

genre = etree.Element("Genre")
genre.text = "New Genre"
tree.xpath('/Title/Genre[last()]')[0].addnext(genre)

Result:

print etree.tostring(tree, pretty_print=True)
JRazor
  • 2,707
  • 18
  • 27
  • Thanks @JRazor. Upon looking my xml, I realized that there is no outer element. All elements are without outer . How would your code change? – Imagine Mar 14 '16 at 02:34
  • No problem @Imagine. – JRazor Mar 14 '16 at 02:36
  • I changed the code in the original question to reflect this. Thanks again. – Imagine Mar 14 '16 at 02:40
  • Works great! One small issue: I have more elements between the elements and . E.g. After all the Value elements, I have SomeOtherElement>Value beofre . When I run the code, New Genre shows up AFTER SomeOtherElement>Value. Is there a way to keep the new together with the old ? Thanks. – Imagine Mar 14 '16 at 03:23
  • Added after last genre @Imagine – JRazor Mar 14 '16 at 03:41
  • Excellent! One minor issue: There is no newline between the elements in the output. The output looks like: AdventureNew Genre. How do you add a newline after Adventure? – Imagine Mar 14 '16 at 04:10
  • Use parser for open code from string or file like me (under import). @Imagine – JRazor Mar 14 '16 at 04:26
1

Consider an XSLT solution when tasked to manipulate original XML files. As just mentioned on this PHP question, XSLT (whose script is a well-formed XML file) is a special purpose, declarative programming language and can handle multiple tasks in one script as illustrated below.

Most general-purpose languages including Python (lxml module), PHP (xsl extension), Java (javax.xml), Perl (libxml), C# (System.Xml), and VB (MSXML) maintain XSLT 1.0 processors. And various external executable processors like Xalan and Saxon (the latter of which can run XSLT 2.0 and recently 3.0) are also available -which of course Python can call with subprocess.call().

Below includes the XSLT and Python scripts respectively as the former is loaded in the latter. And as mentioned above, the xslt is portable to other languages/platforms.

XSLT script (save as .xsl or .xslt)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM (COPY CONTENT AS IS) -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>  

  <!-- CHANGE PLAYCOUNT -->
  <xsl:template match="Playcount">
    <xsl:copy>newvalue</xsl:copy>
  </xsl:template>

  <!-- EMPTY TEMPLATE TO REMOVE NODES BY POSITION -->
  <xsl:template match="Genre[position() &gt; 1]"></xsl:template>

  <!-- ADD NEW GENRE -->
  <xsl:template match="Title">
    <xsl:copy>
      <xsl:apply-templates/>
      <Genre>new</Genre>
    </xsl:copy>
  </xsl:template>

  <!-- CHANGE BASEBALL GENRE TO SPORTS -->
  <xsl:template match="Title[Genre='Baseball']">
    <xsl:copy>Sports</xsl:copy>
  </xsl:template>

</xsl:transform>

Python Script

import lxml.etree as ET

# LOAD XML AND XSLT FILES
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')

# TRANSFORM INTO DOM OBJECT
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUT TO PRETTY PRINT STRING
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))

# SAVE AS FILE
xmlfile = open('Output.xml')
xmlfile.write(tree_out)
xmlfile.close()

Result (notice all above questions being handled below, except Baseball which was not present in posted data)

<?xml version='1.0' encoding='UTF-8'?>
<Title>
  <Playcount>newvalue</Playcount>
  <Genre>Adventure</Genre>
  <Genre>new</Genre>
</Title>
Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks Parfait. I have not used XSLT before, but it looks promising. Also, upon looking my xml, I realized that there is no outer element. All elements are without outer . How would your solution change? I changed the code in the original question to reflect this. Thanks again. – Imagine Mar 14 '16 at 03:03
  • No problem. See edit. Simply change `` to `` (the actual parent tag) in xsl script. But I see you went the other route just like the OP in PHP question in above link, using general purpose language to manipulate the XML doc. XSLT is a forgotten, lost art but I love it and will keep campaigning the lost cause! – Parfait Mar 14 '16 at 17:50
  • Thanks @Parfait. I will definitely look into XSLT. Looks like it is a good investment of time to learn it. – Imagine Mar 18 '16 at 14:38