0

I have a data structure such as following.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

given an input file containing number of files such as

1
3

it would remove the segments that has those name. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

the code I have so far

with open('file.txt', 'r') as inputFile:
    w_file = inputFile.readlines()

w_file = w_file.strip('\n')

with open('to_delete_nums.txt', 'r') as File:
    d_file = deleteFile.readlines()

d_file = d_file.strip('\n')

for line in w_file:
    if line.contains("<segment name"):
        for d in d_file:
            //if segment name is equal to d then delete that segment.

How do I accomplish this? I also think having 2 might be unnecessary is that correct?

1 Answers1

2

Method 1 (with a module):

Just like @iain-shelvington said with a XML parsing/manipulation library You can do it simply and fast;

Try this with lxml module and xpath:

import lxml.etree as et

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""
tree = et.XML(xml.encode())
find_segments = tree.xpath("*//segment[@name='1' or @name='2']") # you can add more segments here

for each_segment in find_segments:
    each_segment.getparent().remove(each_segment)

clean_content = str(et.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(clean_content)

Some credits to @cédric-julien, @Sheena, @xyz, @josh-allemon and these questions:

  1. how to remove an element in lxml
  2. Using an OR condition in Xpath to identify the same element
  3. lxml.etree.XML ValueError for Unicode string

Method 2 (Hard Code):

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""

lines = []
toggle = True
for each_line in xml.splitlines():
    if each_line.strip().startswith("<segment") and ('name="1"' in each_line or 'name="2"' in each_line):
        toggle = False
    elif each_line.strip().startswith("</segment>") and toggle is False:
        toggle = True
    elif toggle:
        lines.append(each_line)

new_xml = "\n".join(lines)
print(new_xml)

If you want to read names from file then try this:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    xml_data = xml_file.read()

with open('nums.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

Much Shorter:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    tree = etree.XML(xml_file.read().encode())

with open('nums.txt', 'r') as file:
    list_of_names = list(set(file.read().split("\n")))

xpath = "*//segment[{}]".format(" or ".join(["@name='{}'".format(each_name) for each_name in list_of_names]))

print(xpath)
for each_segment in tree.xpath(xpath):
    each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)
DRPK
  • 2,023
  • 1
  • 14
  • 27
  • Thank you for your comment! ```@name='1' or @name='2'``` seems like it would be a lot of manual input. Is there a way to automatically read those from a file? In the question, I say that there is already a file containing the names one per line. – Joseph Kars Jan 24 '21 at 07:00
  • @JosephKars; yes wait i will write it. – DRPK Jan 24 '21 at 07:00
  • @JosephKars; check my update and notify me – DRPK Jan 24 '21 at 07:21
  • @JosephKars: updated check again. this one is much shorter than that – DRPK Jan 24 '21 at 07:26
  • Using your last code, I got ```TypeError: str() takes at most 1 argument (2 given)``` – Joseph Kars Jan 24 '21 at 07:50
  • ```*//segment[@name='1' or @name='' or @name='3'] Traceback (most recent call last): File "del_bad_segs.py", line 14, in new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8") TypeError: str() takes at most 1 argument (2 given)``` – Joseph Kars Jan 24 '21 at 07:50
  • @JosephKars: You are on python 2.x right? python3.x str function has 2 arguments but in 2.x it has 1 argument. so just remove the comma and encoding="utf-8" from string and use this: str(x).encode("utf-8")... which x is the previous first argument. notify me.... – DRPK Jan 24 '21 at 07:57
  • @JosephKars: see the first line of your error message: look ate second name... its empty...do you have some errors in your file ? empty lines ? – DRPK Jan 24 '21 at 07:58
  • @JosephKars; if it worked or not notify me – DRPK Jan 24 '21 at 07:59
  • I do not have empty lines and I am using python 3. it appears that using str() to handle this is not ideal. I will accepty your answer regardless. Thank you! – Joseph Kars Jan 24 '21 at 08:01
  • @JosephKars Tnx YW; Are you sure about your python version? because see the error message TypeError: str() takes at most 1 argument (2 given); as i know python3 will not show this error. check this answer about str function and its arguments https://stackoverflow.com/questions/42346984/i-am-getting-this-error-typeerror-str-takes-at-most-1-argument-2-given-at/42347165 – DRPK Jan 24 '21 at 08:04
  • Sorry! misclick yes i'm on python2. Sorry for the confusion. I tried your solution. I am getting ```UnicodeEncodeError: 'ascii' codec can't encode characters in position``` – Joseph Kars Jan 24 '21 at 08:10
  • @JosephKars: replace this and notify me: `new_xml = etree.tostring(tree, pretty_print=True, xml_declaration=True).decode("utf-8")` – DRPK Jan 24 '21 at 08:28
  • I got ```UnicodeEncodeError: 'ascii' codec can't encode characters in position 306-307: ordinal not in range(128)``` – Joseph Kars Jan 24 '21 at 19:24
  • @JosephKars add "ignore" argument and try again: new_xml = etree.tostring(tree, pretty_print=True, xml_declaration=True).decode("utf-8", "ignore") – DRPK Jan 24 '21 at 19:28
  • ```File "del_bad_segs.py", line 13, in tree = etree.XML(new_xml.encode()) UnicodeEncodeError: 'ascii' codec can't encode characters in position 306-307: ordinal not in range(128)``` I got this. – Joseph Kars Jan 24 '21 at 19:38
  • @JosephKars: check this https://stackoverflow.com/questions/13765614/lxml-encoding-error-when-parsing-utf8-xml and notify me – DRPK Jan 24 '21 at 19:40
  • @JosephKars: how much the size of your xml file ? you can upload it i can check and write code for do correction. God willing – DRPK Jan 24 '21 at 19:44