0

I have multiple XML files (nearly a thousand) with data about X and Y coordinates of billiards balls. I am trying to export them all to Pandas dataframes with the headers being X and Y. Presumably I first open each XML file in the folder (something like this), and then use an XPath expression to extract the coordinates.

They look like this:

<?xml version="1.0" encoding="utf-8"?><?mso-application progid="Excel.Sheet"?><Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
  <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
    <Title>Pagulayan vs Yapp (Last 16) 2019 US Open 9-ball 15</Title>
  </DocumentProperties>
  <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
    <WindowHeight>12270</WindowHeight>
    <WindowWidth>14955</WindowWidth>
    <WindowTopX>720</WindowTopX>
    <WindowTopY>315</WindowTopY>
    <ProtectStructure>False</ProtectStructure>
    <ProtectWindows>False</ProtectWindows>
  </ExcelWorkbook>
  <Styles>
    <Style ss:ID="keyimages-title">
      <Alignment ss:Horizontal="Center" ss:Vertical="Bottom" />
      <Borders>
        <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Right" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1" />
      </Borders>
      <Interior ss:Color="#ccffcc" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="chronos-title">
      <Alignment ss:Horizontal="Center" ss:Vertical="Bottom" />
      <Borders>
        <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Right" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1" />
      </Borders>
      <Interior ss:Color="#99ccff" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="track-title">
      <Alignment ss:Horizontal="Center" ss:Vertical="Bottom" />
      <Borders>
        <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Right" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1" />
      </Borders>
      <Interior ss:Color="#cc99ff" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="data">
      <Borders>
        <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Right" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1" />
      </Borders>
      <Interior ss:Color="#ffffff" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="header">
      <Alignment ss:Horizontal="Center" ss:Vertical="Bottom" />
      <Borders>
        <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Right" ss:LineStyle="Continuous" ss:Weight="1" />
        <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1" />
      </Borders>
      <Interior ss:Color="#C0C0C0" ss:Pattern="Solid" />
    </Style>
  </Styles>
  <Worksheet ss:Name="Pagulayan vs Yapp (Last 16) 2019 US Open 9-ball 15">
    <Table x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="60">
      <Row>
        <Cell>
          <Data ss:Type="String" />
        </Cell>
      </Row>
      <Row>
        <Cell ss:MergeAcross="1" ss:StyleID="keyimages-title">
          <Data ss:Type="String">Key Images</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">Name</Data>
        </Cell>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">Time</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell>
          <Data ss:Type="String" />
        </Cell>
      </Row>
      <Row>
        <Cell ss:MergeAcross="3" ss:StyleID="keyimages-title">
          <Data ss:Type="String">Points</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">Name</Data>
        </Cell>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">X</Data>
        </Cell>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">Y</Data>
        </Cell>
        <Cell ss:StyleID="header">
          <Data ss:Type="String">Time</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 1</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">112.64</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-108.00</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 2</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">70.91</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-166.91</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 3</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-287.45</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-100.64</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 4</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-390.55</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">117.82</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 5</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-15.00</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">127.64</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 6</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-645.82</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">22.09</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 7</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-442.09</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-262.64</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 8</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">348.27</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-218.45</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 9</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">512.73</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-198.82</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">Marker 10</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">544.64</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="Number">-238.09</Data>
        </Cell>
        <Cell ss:StyleID="data">
          <Data ss:Type="String">0:00:00:00</Data>
        </Cell>
      </Row>
    </Table>
  </Worksheet>
</Workbook>

The XML files are in Excel format, so the X-coordinate is the second Cell of every Row, and the Y-coordinate is the third Cell. The problem is that since all of them are wrapped up in tags with the same name, namely Data, it is not easy to put them into a node.find() or node.attrib.get() function. It would have been easier if all the tags had unique names instead of Cell and Data.

What might be a good way to access those coordinates through XPath?

  • 3
    If you are woking with Excel files, you might want use a library (like [openpyxl](https://openpyxl.readthedocs.io/en/stable), ...) that handles all that low-level stuff for you – Maurice Meyer Jun 08 '20 at 08:38
  • They may be Excel-openable files, but they are after all XML files. Should I use XML-based solutions like XPath, or workbook (XLSX)-based solutions like openpyxl? – ProximanovaMetropolis Jun 08 '20 at 09:58
  • Maybe [this SO post](https://stackoverflow.com/questions/33470130/read-excel-xml-xls-file-with-pandas) helps. – above_c_level Jun 08 '20 at 11:36
  • Using the sample xml in your question, what exactly would be your expected output? – Jack Fleeting Jun 08 '20 at 15:48
  • I have ~1000 XMLs with X and Y coordinates, and I would like to have ~1000 Pandas dataframes with X and Y coordinates, so that I can perform clustering, etc. on each of them. – ProximanovaMetropolis Jun 09 '20 at 06:38

0 Answers0