Git-diffing Excel files
Posted on Fri 04 March 2016 in Tools
In this post, I’m going to demonstrate how to get a useful diff for binary Excel files that reside in a Git repository. I assume that you’re familiar with Git, and in that case you know that having binary files in a repository is generally not a good idea. Nonetheless, sometimes it’s useful—perhaps they contain test data, or import data, or base data for bootstrapping, or some other useful data that aren’t suitable for text form.
Diffing a binary file
For whatever reason you have an Excel file in your repository, finding a difference in
the file between two commits is by default not possible. Let’s say that you have
uncommitted changes to an Excel file called document.xlsx
. This is what happens if
you try to see what has changed:
$ git diff document.xlsx
diff --git a/document.xlsx b/document.xlsx
index b22099f..ebebbb9 100644
Binary files a/document.xlsx and b/document.xlsx differ
Since the file is binary rather than text, Git gives up and just prints that there is a difference. But that will soon change! :-)
Defining an Excel diff driver
First, download the ExcelCompare tool. Click
on Releases and get the latest one—I used
0.5.1 when testing. Unpack the ZIP file in a folder of your choosing, but make sure it’s in a
persistent location (i.e., unpacking it in a Downloads folder is probably a bad idea). I unpacked to
c:\apps\excelcompare
.
By now, it’s probably a good time to mention that I use Git Bash on Windows. That affects the path handling a bit, as you’ll see.
We will write a script to perform the actual diffing. To begin, edit the .git/config
file in
your repository and amend the following:
[diff "excel"]
command = c:/apps/excelcompare/exceldiff.cmd
This registers a diff driver in your repository, telling Git which command to use when diffing
files of excel type. The exceldiff.cmd
script doesn’t
exist yet, but it will in a minute. If you want to place the script somewhere else, or name it
differently, feel free to do so.
Do you see the path oddity? The command path looks almost like a regular Windows path, except it has forward slashes instead of backward slashes. Using backward slashes, even if they are properly escaped, did not work for me.
Next, we must tell Git to associate certain files with the excel diff driver. To do so, edit
.gitattributes
in the root of your repository and amend the following:
*.xlsx diff=excel
Feel free to add more file patterns (one per line) if you wish. ExcelCompare supports a number of different Excel file types.
Note that .gitattributes
(unlike .gitignore
) must be committed
to have any effect, so add it to a commit at this point.
Finally, open up an editor and create exceldiff.cmd
in the correct location (as specified in .git/config
):
@echo off
c:\apps\excelcompare\excel_cmp.bat %2 %5
(Adjust the path to excel_cmp.bat
so that it matches the location where you unpacked ExcelCompare.)
Why arguments two and five? It’s because Git passes seven arguments to a diff command, of which the second is the path to the old file (which likely is a temporary file) and the fifth is the path to the new file. For more detailed information, have a look at the Git documentation, specifically under GIT_EXTERNAL_DIFF.
Back to diffing
Once the script has been saved, we can attempt the diff again:
$ git diff document.xlsx
DIFF Cell at Sheet1!A3 => 'X' v/s 'Y'
----------------- DIFF -------------------
Sheets: [Sheet1]
Rows: [3]
Cols: [A]
----------------- EXTRA WB1 -------------------
Sheets: []
Rows: []
Cols: []
----------------- EXTRA WB2 -------------------
Sheets: []
Rows: []
Cols: []
-----------------------------------------
Excel files C:\...\AppData\Local\Temp\TRAkga_document.xlsx and document.xlsx differ
This is the raw output from ExcelCompare. As can be seen, the uncommitted change is apparently that I changed the value of cell A3 on sheet Sheet1 from X to Y (which happens to be exactly what I did).
And that’s it! Much, much more useful than before!
A small note
As can be seen from the example above, the diff output doesn’t have the same format as a “native” textual Git diff, which may look like this:
$ git diff foo
diff --git a/foo b/foo
index 9972d7e..8320b47 100644
--- a/foo
+++ b/foo
@@ -1,2 +1,2 @@
-bar
+baz
Specifically, the initial metadata lines are not present, and the differences are
not in patch format. This means that an Excel diff cannot be used for patching an Excel
file. If you attempt to use git format-patch
, you’ll see that it just ignores the Excel diff.
Summary
In this post, I have shown how to configure a Git repository to use a custom command for showing the difference between different revisions of an Excel file. If you found it useful, please let me know in the comments!
Update 2016-04-17
- Fixed some typos.
- Fixed incorrect
.git/config
entry. - Added a paragraph about committing the
.gitattributes
files.