0

I have an XML structured like:

<pages>
 <page>
  <textbox>
    <new_line>
     <text>
     </text>
    </new_line>
  </textbox>
 </page>
</pages>

I'm iterating over text elements that are children of the new_line element to join tags with the same size attribute. But I want to specify that the new_line element has to be inside the textbox element. I tried adding a for loop in my code but it simply doesn't work. Here is the code:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "text" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)



newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

EDIT:

Sample string:

"""<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

Expected output:

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </new_line>
        </textbox>
    </page>
</pages>
Anna
  • 369
  • 2
  • 10

1 Answers1

2

You can define a recursive function to solve the multi-layer XML in your case. I wrote a shortcode for this problem.

import sys
import xml.etree.ElementTree as etree

def add_sub_element(parent, tag, attrib, text='None'):
    new_feed = etree.SubElement(parent, tag, attrib)

    if(text):
        new_feed.text = text

    return new_feed


def my_tree_mapper(parent_tag, current, element):

    if(current.tag == 'new_line' and parent_tag == 'textbox'):

        current_size = -1
        current_text = ""

        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            if(child_tag == 'text' and 'size' in child_attrib):
                if(child_attrib['size'] == current_size):
                    # For 'text' children with the same size
                    # Append text until we got a different size
                    current_text = current_text + child_text
                else:
                    if(current_size != -1):
                        # Add sub element into the tree when we got a different size
                        sub_element = add_sub_element(
                            current, child_tag, {'size': current_size}, current_text)

                    current_size = child_attrib['size']
                    current_text = child_text

            else:
                if(current_size != -1):
                    # Or add sub element into the tree when we got different tag
                    sub_element = add_sub_element(
                        current, child_tag, {'size': current_size}, current_text)

                # No logic for different tag
                sub_element = add_sub_element(
                    current, child_tag, child_attrib, child_text)
                my_tree_mapper(current.tag, sub_element, child)

                current_size = -1
                current_text = ""
    else:
        # No logic if not satisfy the condition
        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            sub_element = add_sub_element(
                current, child_tag, child_attrib, child_text)
            my_tree_mapper(current.tag, sub_element, child)


the_input = """<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

tree = etree.ElementTree(etree.fromstring(the_input))
root = tree.getroot()
new_root = etree.Element(root.tag, root.attrib)

my_tree_mapper('', new_root, root)
print(etree.tostring(new_root))

Hope this can help you, or at least give you some idea.

(In case you want to read more about Incursive Functions document and example. And more about XML etree methods here)

Hnampk
  • 517
  • 6
  • 17
  • Thank you, this is very useful! One thing though- The output I get is not pretty printed, how can I solve this problem? – Anna Apr 28 '20 at 09:30
  • 1
    There are several ways to beautify the XML. You can use the lxml library For more options, I found this post (https://stackoverflow.com/questions/749796/pretty-printing-xml-in-python/10133365) – Hnampk Apr 28 '20 at 10:02