0

I am having a hard time generating a .xml file from a Pandas DataFrame. I am using this solution (How do convert a pandas/dataframe to XML?) (sorry, for some reason stack wont let me link a word to the site), but am trying to add an additional field. The original solution works if I do not include the shape parameters, but I do need to add the values into the .xml file. I am not sure why I cannot call the function with the arguments. In addition to calling the function, I am having a hard time writing as an xml. I have searched through a few other stack questions and have found that this code chunk works, but when I open the .xml file I get only four numbers (30, 1, 67, 44) which are the. Though if I open it in pycharm I get the "desired" view.

file_handle = open("output.xml", "w")
Q.writexml(file_handle)
file_handle.close()

Code:

print(image_x.shape)
output: (185, 186, 3)

width = image_x.shape[0]
height = image_x.shape[1]
depth = image_x.shape[2]

def func(row, width, height, depth):
    xml = ['<item>']
    shape = [f'<width>{width}</width>\n<height>{height}</height>\n<depth>{depth}</depth>']
    for field in row.index:
        xml.append('  <{0}>{1}</{0}>'.format(field, row[field]))
    xml.append('</item>')
    xml.append(shape)
    return '\n'.join(xml)

xml_file = func(df, width, height, depth)

df:

   xmin  ymin  xmax  ymax
0    30     1    67    44
1    39   136    73   176

Error:

Traceback (most recent call last):
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:/PycharmProjects/Augmentation/random_shit.py", line 100, in <module>
    Q = func(df, width, height, depth)
  File "D:/PycharmProjects/Augmentation/random_shit.py", line 95, in func
    xml.append('  <{0}>{1}</{0}>'.format(field, row[field]))
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 0

Desired Output:

<annotations>
  <size>
    <width>185</width>
    <height>186</height>
    <depth>3</depth>
  </size>
  <item>
    <xmin>30</xmin>
    <ymin>1</ymin>
    <xmax>67</xmax>
    <ymax>44</ymax>
  </item>
  <item>
    <xmin>39</xmin>
    <ymin>136</ymin>
    <xmax>73</xmax>
    <ymax>176</ymax>
  </item>
</annotations>

Binx
  • 382
  • 7
  • 22
  • it looks like you are providing the full df to your function. but from your link, it's meant to be applied to each row in the df. – KJDII Jan 18 '21 at 23:48
  • @KJDII Okay I think I understand. That is why the have `'\n'.join(df.apply(func, axis=1)`? Would I need to create a function within a function? – Binx Jan 18 '21 at 23:53
  • you might be able to do exactly ```'\n'.join(df.apply(func, axis=1)``` Maybe: ```xml_file = '\n'.join(df.apply(func, axis=1) ``` – KJDII Jan 18 '21 at 23:56
  • I can do the `'\n'.join(df.apply(func, axis=1)` and set it to a variable such as what you have. But that is only if I do not include the `shape` stuff. Also, the output `.xml` does not save correctly (as mentioned in my questions). – Binx Jan 18 '21 at 23:59

2 Answers2

2

Single-liner func:

def func(df, width, height, depth):
    return '<annotations>\n'+f'<width>{width}</width>\n<height>{height}</height>\n<depth>{depth}</depth>\n'+df.apply(lambda row:f'<item>\n<xmin>{row.xmin}</xmin>\n<ymin>{row.ymin}</ymin>\n<xmax>{row.xmax}</xmax>\n<ymax>{row.ymax}</ymax>\n</item>\n',axis=1).str.cat()+'\n</annotations>'

Concatenating strings with + and using a map-reduce approach to the dataframe with apply and cat. Apply will build each dataframe row and transform it to a string equivalent to the <item> tag, and str.cat() will concatenate each line (also renamed the input parameter row to df)

brunoff
  • 4,161
  • 9
  • 10
  • Thank you so much. Could you please explain what the `lambda` is doing? I have come across this before, but do not understand the documentation. – Binx Jan 19 '21 at 00:20
  • in general terms, `lambda` is the way you can have functions in place, instead of referencing a named function that you have declared elsewhere with `def` – brunoff Jan 19 '21 at 23:45
0

Because XML is not exactly a text file, avoid the common pet peeve of building XML with string concatenation. So avoid solutions in that linked post which may not properly handle encoding in your data. Recall XML stands for Extensible Markup Language that defines a set of rules for encoding documents.

Therefore, consider using a compliant DOM library like Python's built-in etree or feature-rich, third-party lxml:

import xml.etree.ElementTree as et 
# import lxml.etree as et

root = et.Element("annotations")

size = et.SubElement(root, "size")
et.SubElement(size, "width").text = str(image.shape[0])
et.SubElement(size, "length").text = str(image.shape[1])
et.SubElement(size, "depth").text = str(image.shape[2])

data = image.to_dict(orient='records')

for d in data:
   item = et.SubElement(root, "item")

   for k, v in d.items():
      et.SubElement(item, k).text = str(v)

with open("output.xml", "wb") as f:
   f.write(et.tostring(root, encoding="utf8"))

To pretty print output with line breaks and indentation, use toprettyxml in built-in minidom. Note: lxml.etree has a pretty_print argument in its tostring call.

from xml.dom.minidom import parseString

# ...same code as above except write output

dom = parseString(et.tostring().decode("utf-8"))

with open("output.xml", "wb") as f:
   f.write(dom.toprettyxml(encoding="utf8"))
Parfait
  • 104,375
  • 17
  • 94
  • 125