-3

I have an XML structured like this:

"""<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

I'm iterating over text elements that are children of the new_line element to join tags with the same size attribute. But I want to specify that the new_line element has to be inside the textbox element. So I want to iterate over textbox too. I tried adding a for loop in my code but it simply doesn't work. Here is the code:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "text" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)



newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

My expected output:

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </new_line>
        </textbox>
    </page>
</pages>

Right now my code doesn't work because it joins one tag and then skips the next one, I think not specifying textbox is the problem.

Anna
  • 369
  • 2
  • 10
  • 2
    You are asking many similar questions. How is this question different from https://stackoverflow.com/q/61444282/407651? The code is the same in both questions. – mzjn Apr 27 '20 at 08:47

1 Answers1

0

Although your question is similar to the previous one, this time the problem is more simple and clear. You can extract the data first, and then spell it into the format you want. Here's an example.

from simplified_scrapy import SimplifiedDoc, req, utils xml = """

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""
doc = SimplifiedDoc(xml)
new_line = doc.new_line
lastSize = None
lst = []
texts = ""
for t in new_line.texts:
    if not lastSize or t.size==lastSize:
        texts += t.text
        lastSize = t.size
    else:
        lst.append((lastSize,texts))
        texts = t.text
        if t.size:
            lastSize = t.size
        else: 
            lst.append("<text />")
            lastSize=None
print(lst)

Reslut:

[('12.482', 'C'), ('12.333', 'API'), ('12.482', 'TOLO'), '<text />', ('12.482', 'III'), '<text />']
dabingsou
  • 2,469
  • 1
  • 5
  • 8