1

I have an .xls file which I want to pretty print in order to have nice diffing rather than just binary files being changed.

My approach is to unzip this entire thing. The resulting string does not contain linebreaks so I ran it through xmllint --format. But on this seemingly simple path I have encountered several issues which I have already spent hours on:

  1. unzip multiple files inside the xml archive. This results in invalid xml. Even with unzip -q options I get multiple DTDs and so on. xmllint breaks on this without formatting the input.

    unzip -c -a -q myFile.xlsx | xmllint --format -

  2. I tried splitting the XML into an array using read in order to feed each individual xml file to xmllint. In the result of read most array items seem to be empty and the third and fourth item contain 20something letters of the xml string.

    IFS='\<\?xml' read -r -a files <<< "$decompressed"

  3. I also tried just inserting linebreaks with sed but the filesize is so large that processing takes too long for making it feasible for diffing.

    ${decompressed/\>\</\>\n\</g}

I have just run out of ideas so I decided to consult you guys! Thanks ahead :)

Jan Wirth
  • 411
  • 2
  • 13

2 Answers2

1

As you already found out, the XLSX file contains more than one XML file, as specified in the official documentation. In particular, there will be one XML document per Excel sheet, which means just combining them will not be a practical solution.

To make matters worse, you might also have a shared string table in Excel files, which means the sheet files themselves will not contain the original strings ("Hello, World!"), but instead just a reference to the string table (1234).

Depending on your use case, you may want to consider more text-based formats like CSV.

user3151902
  • 3,154
  • 1
  • 19
  • 32
1

I would extract the xslx into multiple files, pretty-print them and then do a recursive diff on the xml files. (there are also binary files that can't be diffed)

Like this:

# Unzip the xlsx files into folders
unzip -aqd foo foo.xlsx
unzip -aqd bar bar.xlsx

# Pretty print all .xml and .rels files
find foo bar \( -name '*.xml' -o -name '*.rels' \) -exec xmllint --format {} --output {} \;

# Now you can recursively diff them
diff -urN foo bar

You can create bash function that wraps the process.

Note: The unpacked xlsx file may contain binary files with .bin ending. You need to exclude them from diff:

diff -urNx '*.bin' foo bar
hek2mgl
  • 152,036
  • 28
  • 249
  • 266