The accepted solution (using strings / unzip ) didn't work very well for me on Linux Mint 19.3. The following seems to work pretty well for most doc/docx/rtf/xls files as well as their LibreOffice counterparts. Some of these might work on Windows via cygwin/git bash but I have not tested; if the packages I mention are not available in cygwin/git bash, then I would look for python/perl scripts that do the same conversion and substitute with those instead.
- Install prerequisites:
sudo apt install git pandoc catdoc odt2txt
.
- Note that catdoc and odt2txt include multiple tools for handling doc/xls/ppt/odt/ods/odp formats not just the ones in the package name. Likewise, pandoc handles all of the newer zipped 'x' formats.
- I wanted my attributes to apply as Global (e.g. User-scoped) rather than per-project as done in the other answers. To create User-scoped git attributes file, use
mkdir ~/.config/git/ && touch ~/.config/git/attributes
(on Windows this should be mkdir "%USERPROFILE%\.config\git" && echo "" > "%USERPROFILE%\.config\git\attributes"
)
- Setup git attributes file (either the user-scoped file mentioned in the previous step or the project-scoped file
${projectDir}/.git/info/attributes
as desired):
# handle windows *.reg files (utf-16 which git doesn't normally like)
*.reg diff=utf16
# handle misc common document formats
*.pdf diff=pdf
*.rtf diff=catdoc
# handle libre/open document formats
*.ods diff=ods2txt
*.odp diff=odp2txt
*.odt diff=odt2txt
# handle older common ms document formats
# note: ppt did not work for me
*.doc diff=catdoc
*.ppt diff=catppt
*.xls diff=xls2csv
# handle newer zipped ms document formats
# note: pptx and xlsx did not work for me
*.docx diff=pandoc
*.pptx diff=pandoc
*.xlsx diff=pandoc
- Create .gitconfig definitions (either in the user-scoped
~/.gitconfig
or in the project-scoped ${projectDir}/.git/config
). Much of this is based on this article but altered based on my own testing.
[core]
autocrlf = false
[diff]
guitool = kdiff3
[diff "odp2txt"]
textconv = odp2txt
binary = true
[diff "odt2txt"]
textconv = odt2txt
binary = true
[diff "ods2txt"]
textconv = ods2txt
binary = true
[diff "catdoc"]
textconv = catdoc
binary = true
# note catppt did not work for me
[diff "catppt"]
textconv = catppt
binary = true
[diff "xls2csv"]
textconv = xls2csv
binary = true
[diff "xlsx2csv"]
textconv = xlsx2csv
binary = true
[diff "pandoc"]
textconv=pandoc --to=markdown
prompt = false
[diff "pdf2txt"]
textconv=pdf2txt
binary = true
[diff "utf16"]
textconv = iconv -c -f UTF-16LE -t ASCII
I was never able to successfully get diffs working for xlsx, ppt, or pptx even after downloading the latest version of pandoc from their github page. The docx conversion worked fine even with the super old version that is in the Mint/Ubuntu/Debian repos (v1.19.2.4 from 2016). For the xlsx/pptx samples I was using, I always got either "Invalid UTF-8 stream fatal" (old version) or "UTF-8 decoding error" (new version).
This could have been due to the sample files I was using (some samples from the web and some samples I created by converting LibreOffice documents), my system setup, the versions I was using or something else.
For completeness, after installing the newer pandoc, I was using:
$ uname -vipor
5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 GNU/Linux
$ dpkg -l catdoc odt2txt pandoc git xlsx2csv|grep '^ii'
ii catdoc 1:0.95-4.1 amd64 text extractor for MS-Office files
ii git 1:2.17.1-1ubuntu0.5 amd64 fast, scalable, distributed revision control system
ii odt2txt 0.5-1build2 amd64 simple converter from OpenDocument Text to plain text
ii pandoc 2.9.2-1 amd64 general markup converter
ii xlsx2csv 0.20+20161027+git5785081-1 all convert xslx files to csv format
EDIT: Also tried using the package xlsx2csv
for xlsx conversion instead of pandoc and I had issues with that as well. Could be something to do with my samples but since I am not really doing anything special to create them I would consider that a coverage-gap / limitation of xlsx2csv/pandoc if so.