0

I have a XML file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
    <Review rid="1004293">
        <sentences>
            <sentence id="1004293:0">
                <text>Judging from previous posts this used to be a good place, but not any longer.</text>
                <Opinions>
            </sentence>
            <sentence id="1004293:1">
                <text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
                <Opinions>
            </sentence>
            <sentence id="1004293:2">
                <text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
                <Opinions>
                    <Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
                </Opinions>
            </sentence>
        </sentences>
    </Review>

How to delete those sentences without opinions? And left those sentences where text has an opinion? I would like to get something like that:

<sentences>
        <sentence id="1004293:2">
            <text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
            <Opinions>
                <Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
            </Opinions>
        </sentence>
    </sentences>

3 Answers3

2

I would convert the xml to a dict using this module, for example: How to convert an xml string to a dictionary?, filter out the nodes that you do not want and reconvert to xml....

Matthias
  • 440
  • 3
  • 16
1

Consider using XSLT, the special-purpose language designed to transform XML documents. Specifically, run the identity transform then an empty template on sentence with needed condition.

XSLT (save as an .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
    </xsl:template>

    <!-- EMPTY TEMPLATE TO DELETE NODE(S) -->
    <xsl:template match="sentence[text and not(Opinions/*)]"/>

</xsl:stylesheet>

Online Demo

Python (using third-party module, lxml)

import lxml.etree as et 

doc = et.parse('/path/to/Input.xml') 
xsl = et.parse('/path/to/Script.xsl') 

# CONFIGURE TRANSFORMER 
transform = et.XSLT(xsl) 

# TRANSFORM SOURCE DOC 
result = transform(doc) 

# OUTPUT TO CONSOLE 
print(result) 

# SAVE TO FILE 
with open('Output.xml', 'wb') as f: 
   f.write(result)
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

Using builtin XML library (ElementTree).

Note: The XML you have posted was not a valid one and I had to fix it.

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
   <Review rid="1004293">
      <sentences>
         <sentence id="1004293:0">
            <text>Judging from previous posts this used to be a good place, but not any longer.</text>
            <Opinions />
         </sentence>
         <sentence id="1004293:1">
            <text>We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.</text>
            <Opinions />
         </sentence>
         <sentence id="1004293:2">
            <text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
            <Opinions>
               <Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0" />
            </Opinions>
         </sentence>
      </sentences>
   </Review>
</Reviews>
'''

root = ET.fromstring(xml)
sentences_root = root.find('.//sentences')
sentences_with_no_opinions = [s for s in root.findall('.//sentence') if not s.find('.//Opinions')]
for s in sentences_with_no_opinions:
    sentences_root.remove(s)


print(ET.tostring(root))

output

<?xml version="1.0" encoding="UTF-8"?>
<Reviews>
   <Review rid="1004293">
      <sentences>
         <sentence id="1004293:2">
            <text>They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.</text>
            <Opinions>
               <Opinion category="SERVICE#GENERAL" from="0" polarity="negative" target="NULL" to="0" />
            </Opinions>
         </sentence>
      </sentences>
   </Review>
</Reviews>
balderman
  • 22,927
  • 7
  • 34
  • 52