-1

I want to remove all lines that contain all words in the 'xml_lines' list. I created this script:

from pathlib import Path

# Provide relative or absolute file path to your xml file
filename = './.content.xml'
path = Path(filename)

conntents = path.read_text()

xml_lines = [
    'first',
    'second',
]

lines = conntents.splitlines()

removed_lines = 0

for line in lines:
    for xml_line in xml_lines:
        if xml_line in line:
            lines.remove(line)
            removed_lines += 1
            print(f'Line: "{line.strip()}" has been removed!')

print(f"\n\n{removed_lines} lines have been removded!")

path.write_text(str(lines))

At the and I have a file that does not look like xml. Can anyone help?

Example (before):

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        first="2d2md"
        second="m3d39d93">
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        first="hfdfherbre"
        second="m3d39d93">
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama"
        first="th54b4"
        second="45b45gt45h">
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

if script finds any line that contain 'first' or 'second', the entire line should be removed:

<?xml version="1.0"?>
<data>
    <country
        name="Liechtenstein"
        >
            <rank updated="yes">2</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
    </country>
    <tiger
        name="Singapore"
        >
            <rank updated="yes">5</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
    </tiger>
    <car
        name="Panama">
        >
            <rank updated="yes">69</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
    </car>
</data>

This is only an example, entire xml file consists of 9999999 lines...

Parfait
  • 104,375
  • 17
  • 94
  • 125
Mag
  • 207
  • 1
  • 8
  • 2
    If you remove arbitrary lines from an XML document it's highly likely that you'll corrupt it. You need to use something that understands XML (e.g., xml.etree) then remove the element(s) from the document using appropriate functions from that module. Then rewrite the file. Also, **never** modify a list while you're iterating over it (unless you like surprises). Give an example of your XML document and what you want to remove – DarkKnight May 29 '23 at 16:04
  • 1
    Show an example XML you want to modify. Generally in XML there's not such thing as "lines" - you might want to remove nodes with certain name, attribute or value. E.g. first. XML node and attribute names are case-insensitive, while values are. – Pawel May 29 '23 at 16:16
  • share the input xml and explain how should it look after modification and what is the logic of the modification – balderman May 29 '23 at 16:18
  • I added an example. – Mag May 29 '23 at 16:28
  • 1
    It is better to use XSLT for the task. Are you open to it? – Yitzhak Khabinsky May 29 '23 at 17:03
  • 1
    You can look here: https://stackoverflow.com/questions/3593204/how-to-remove-elements-from-xml-using-python I think this will do what you need. – user1200296 May 29 '23 at 17:15
  • 1
    Avoid treating XML as a text file. See [What's so bad about building XML with string concatenation?](https://stackoverflow.com/q/3034611/1422451) Use compliant DOM libraries like Python's `etree` or `lxml`. – Parfait May 29 '23 at 17:38
  • It seems like you want to specific remove element attributes rather than lines. That the attributes happen to be on their own lines is meaningless in XML. – Ouroborus May 29 '23 at 19:32
  • Yeap I want to remove all attributes ('first' and 'second') and different values assigned to them, furthermore all attributes are nested in different elements like , , an so on. Entire xml file consists of 9999999 lines... – Mag May 29 '23 at 20:06

3 Answers3

1

Consider XSLT the special-purpose language designed to transform XML files. Specifically, an identity template and empty template can remove the needed attributes across entire document without a single for loop. Python's lxml third-party package can run XSLT 1.0 scripts.

XSLT (save as .xsl file, a special XML file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    
    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- EMPTY TEMPLATE TO REMOVE CONTENT -->
    <xsl:template match="@first|@second"/>
</xsl:stylesheet>

Online Demo

Python

import lxml.etree as lx

# PARSE XML AND XSLT
doc = lx.parse("Input.xml")
style = lx.parse("Style.xsl")

# CONFIGURE AND RUN TRANSFORMER
transformer = lx.XSLT(style)
result = transformer(doc)

# OUTPUT TO FILE
result.write_output("Output.xml")
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Good answer, +1 from my side! – Yitzhak Khabinsky May 29 '23 at 21:54
  • I have this kind of error: lxml.etree.XSLTParseError: xsltCompilePattern : failed to compile '@first' – Mag May 30 '23 at 07:00
  • Hmmm...I tested your exact posted XML and my XSLT and did not face any lxml error. What Python version are you running `import sys; print(sys.version)` and lxml version: `print(lxml.__version__)`? – Parfait May 30 '23 at 15:26
0

You could do something simple along the lines described in this answer, basically using xpath and lxml (and there may be other ways to do the same):

from lxml import etree
doc = etree.parse("your xml file")

to_drop = ["first","second"]
for td in to_drop:
    for target in doc.xpath('//*'):
        target.attrib.pop(td, None)
print(etree.tostring(doc).decode())

Output should be your expected output.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
0

For huge xml files you can use iterparse() and manipulate the attribute values:

import xml.etree.ElementTree as ET

filename = "outfile.xml"
with open(filename, 'wb') as out:
    out.write(str.encode('<?xml version="1.0"?>\n<data>\n'))

attrib_list = ['first','second']

def removekey(d, keys):
    r = dict(d)
    for key, value in keys.items():
        del r[key]
    return r

for event, elem in ET.iterparse("pop_del.xml", events=("start","end")):
    n = {k: elem.attrib[k] for k in elem.attrib.keys() & set(attrib_list)}
    if len(n) != 0:
        elem.attrib = removekey(elem.attrib, n)
        with open("outfile.xml", 'ab') as out:
            out.write(ET.tostring(elem))
            
with open(filename, 'ab') as out:
    out.write(str.encode('</data>'))

Output:

<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E" />
    <neighbor name="Switzerland" direction="W" />
  </country>
  <tiger name="Singapore">
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N" />
  </tiger>
  <car name="Panama">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W" />
    <neighbor name="Colombia" direction="E" />
  </car>
</data>

You can use pop() or del() to remove a attribute from tag element.

Hermann12
  • 1,709
  • 2
  • 5
  • 14