0

I'm new to python, so i have this XML file which i can open pretty easily using Excel, this is XML file

andi want to convert it to .xlsx or any compatible format to be able to use openpyxl module to it so that i can read it easily, is there any way to do this on python? Any advice would be appreciated thankyou.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Jolzal
  • 547
  • 7
  • 19
  • not Python but command line: https://stackoverflow.com/questions/30349542/command-libreoffice-headless-convert-to-pdf-test-docx-outdir-pdf-is-not – Tin Nguyen Sep 29 '20 at 11:26
  • `xlsx` *is* XML - it's a ZIP package containing XML documents. Where did this XML file come from? Why not modify the code that generated this to create a proper `xlsx` file? – Panagiotis Kanavos Sep 29 '20 at 11:42
  • @TinNguyen how is this related to this question? The linked question is about using LibreOffice to convert documents. This question asks how to convert what looks like a 2003 XML *Excel* file into the current Excel format – Panagiotis Kanavos Sep 29 '20 at 11:44
  • @PanagiotisKanavos LibreOffice Calc is an open source alternative of Microsoft Excel. You can open Excel XML and Excel XLSX with it and save it in different file formats. Furthermore it has a command line for file conversion. I would link Microsoft Excel directly but from my quick research they don't offer a CLI for that. – Tin Nguyen Sep 29 '20 at 11:48
  • SpreadsheetML was used only briefly between 2003 and 2006 as a stop-gap due to patent problems. It was replaced by `xlsx` in 2006. Very few if any libraries support it directly. Unless you have to read 15-year old files, it's better to use a library like `xlsxwriter` to write proper `xlsx` files – Panagiotis Kanavos Sep 29 '20 at 11:50
  • @TinNguyen Excel offers a full automation API, not just command lines. That's not what this question is about though. Besides, nothing says that Calc can read the Excel 2003 XML format. In fact [there are questions that show this isn't possible](https://ask.libreoffice.org/en/question/144164/unable-to-open-xls-spreadsheetml-xml-spreadsheet-2003-or-xmlss/). – Panagiotis Kanavos Sep 29 '20 at 11:57
  • @PanagiotisKanavos thankyou for helping, it was auto generated like that from my school report system, and i have the task to retrieve the data inside it so i dont think i can modify the code that generated this, so there really is no library supporting this? or should i parse the xml manually? thankyou – Jolzal Sep 29 '20 at 12:09
  • School and government systems *can* be that old. The easiest way is to load it in Excel and save it again as `xlsx`. You can use Excel Automation to drive Excel from Python if you want. This works for one-off conversions or a small number of files. If you don't care about looks, it's actually very easy to parse this XML file - you need to read the `Row` and `Cell` elements to retrieve each row's data, put it in an array and use eg [xlsxwriter](https://xlsxwriter.readthedocs.io/tutorial01.html) to export the array to `xlsx` – Panagiotis Kanavos Sep 29 '20 at 12:20

1 Answers1

0

I had the same problem. This answer helped me Attempting to Parse an XLS (XML) File Using Python

You may save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel

import win32com.client
import pandas as pd

original_file = "Your_downloaded_file.xml"
output = "Your_converted_file.xlsx"

xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open(original_file)
xlWbk.SaveAs(output, 51)
xlWbk.Close(True)
xlApp.Quit()

output_df = pd.read_excel(output)
print(output_df.columns.ravel())

Kostas Nitaf
  • 428
  • 1
  • 2
  • 12