2

I have a long XML structured like this:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
<new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">s</text>
          <text font="NUMPTY+ImprintMTnum" bbox="284.964,553.628,290.760,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">v</text>
          <text font="NUMPTY+ImprintMTnum" bbox="290.382,553.628,295.477,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">e</text>
          <text font="NUMPTY+ImprintMTnum" bbox="295.333,553.628,301.707,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">n</text>
          <text font="NUMPTY+ImprintMTnum" bbox="301.563,553.628,305.390,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">t</text>
          <text font="NUMPTY+ImprintMTnum" bbox="305.245,553.628,311.620,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">u</text>
          <text font="NUMPTY+ImprintMTnum" bbox="311.475,553.628,315.992,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">r</text>
          <text font="NUMPTY+ImprintMTnum" bbox="315.847,553.628,320.942,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="320.798,553.628,324.625,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">]</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="331.445,553.639,337.241,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">p</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="337.097,553.639,340.924,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">s</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="340.312,553.639,343.560,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">.</text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="343.416,553.639,346.319,566.366" colourspace="DeviceGray" ncolour="0" size="12.727"> </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">m</text>
          <text font="NUMPTY+ImprintMTnum" bbox="365.139,553.628,368.387,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">i</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">-</text>
        </new_line>
</textline>
    </textbox>
</page>
</pages>

The actual XML is way longer and has more pages.

You can see the "size" tag has different sizes. I want to join the letters of the text tags within the <new_line> tag that have the same sizes, keeping their order of appearance.

My expected output is an XML file:

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
<new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="280.592,553.628,285.109,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura ] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="327.763,553.639,331.590,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text> 
          <text font="NUMPTY+ImprintMTnum" bbox="355.660,553.628,365.283,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">mi-</text>

</textline>
    </textbox>
</page>
</pages>

Important, the order of the characters has to be kept. I tried in many ways but with no success. How is it possible to achieve my desired output?

EDIT: I tried to compare the attributes like this, but I need to keep the tag:

  words = []
    root = ET.fromstring(xml)
    pages = root.findall('.//page')
    for page in pages:
        previous_key = None
        current_key = None
        texts = page.findall('.//text')
        for txt in texts:
            if previous_key:
                current_key = (txt.attrib.get('font',previous_key[0]),txt.attrib.get('size',previous_key[1]))
            else:
                current_key = (txt.attrib.get('font','empty'),txt.attrib.get('size','empty'))
            if current_key != previous_key:
                words.append([])
            words[-1].append(txt.text)
            previous_key = current_key

    for group in words:
        if group:
            print(''.join(group))
Anna
  • 369
  • 2
  • 10

1 Answers1

1

You can try the following approach:

  • Iterate over all new_line elements. For all these new_lines:
    • Find all children text elements and save it in a list using findall.
    • Iterate over the text_list with current and previous elements using zip (see this discussion for more details: zip(l[:-1], l[1:])
    • Get the size of current and previous element
    • If they are equals and not both null:
      • Get current and previous text
      • Add them to current element
      • Remove the previous element using remove

Code

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "test" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct  
            # Remove preivous element             
            previous_text.getparent().remove(previous_text)


newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
newtree = newtree.decode("utf-8")

output.xml

<pages>
  <page id="1" bbox="0.000,0.000,462.047,680.315" rotate="0">
    <textbox id="0" bbox="191.745,592.218,249.042,603.578">
      <textline bbox="191.745,592.218,249.042,603.578">
        <new_line>
          <text font="QKWQNQ+ImprintMTnum-Bold" bbox="272.661,554.072,277.415,564.757" colourspace="DeviceGray" ncolour="0" size="10.685">1</text>
          <text font="NUMPTY+ImprintMTnum" bbox="324.480,553.628,327.384,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">sventura] </text>
          <text font="PYNIYO+ImprintMTnum-Italic" bbox="346.709,553.639,352.505,566.366" colourspace="DeviceGray" ncolour="0" size="12.727">sps. a</text>
          <text font="NUMPTY+ImprintMTnum" bbox="368.242,553.628,372.759,566.110" colourspace="DeviceGray" ncolour="0" size="12.482">mi-</text>
        </new_line>
      </textline>
    </textbox>
  </page>
</pages>

I let you adapt it to process different pages !

Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
  • It seems great! But I get this error: new_line_block.remove(previous_text) File "src\lxml\etree.pyx", line 943, in lxml.etree._Element.remove ValueError: Element is not a child of this node. – Anna Apr 17 '20 at 08:13
  • Have a look at the update. I think there are some `` element not directly below the `` elements. There is an intermediate tag like ` ... `. The current solution doesn't care about intermediate elements.. – Alexandre B. Apr 17 '20 at 09:03
  • Thank you! that should have been the problem, but now another problem occurs, I get this error: line 25, in previous_text.getParent().remove(previous_text) AttributeError: 'lxml.etree._Element' object has no attribute 'getParent' – Anna Apr 17 '20 at 09:30
  • Thank you again, I just noticed something: the joining of letters has to be done just in elements, while your code provides a union regardless of opening and closing of . That is to say, I want to join elements only inside tag until it is closed, then join again when a new one is open. How can I solve this? – Anna Apr 18 '20 at 09:03