15 February 2014
Why The Mercurial ZipDoc Extension Fails For Excel Files
I tend to work on small projects with small file-based databases (usually SQL Server Compact Edition). I find it very convenient to use Excel files to create the seed data for these small databases. I even have a little library that uses a combination of EPPlus, Entity Framework Code First, and good ol' ADO.NET to basically convert any properly formatted Excel file into a SQL Server Compact database.
Given that these Excel files are part of a project and change over time, I like to keep them in source control (in my case, Mercurial). Unfortunately, given that Excel files are not simple text files, it's not easy for Mercurial to determine what has changed from version to version. This makes it basically impossible to visualize what has changed, and over time, it can bloat the repository because the deltas are not stored efficiently. I've always wondered if there was some way to do this better.
After a little Googling, I came across the Mercurial ZipDoc extension and downloaded it. After a little more Googling, I realized I didn't need to download it, since it was already included with TortoiseHg. ZipDoc is supposed to allow for more efficient delta compression when committing Open Office format files (such as docx and xlsx).
After making the appropriate modifications to mercurial.ini file, I performed a small test with a sample Excel 2010 file. My file consisted of one worksheet with the following content:
| A B C D E F G H I J ------------------------------------------------------------------ 1 | 1 1 1 1 1 1 1 1 1 1 2 | 2 2 2 2 2 2 2 2 2 2 ... | 10000 | 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000
In other words, columns A to J of rows 1 to 10,000 were filled sequentially with numbers.
When I saved this file, its size on disk was 318 KB.
My test was very simple:
- Create a repo with just the Excel file (Book1.xlsx) in it.
- Change the value of cell A1 to 10.
- Commit changes.
- Change the value of cell A1 back to 1.
- Commit changes.
After each commit, I noted the size of the .hg/store/data folder.
The (dismal) results
Here are the results:
ZipDoc ZipDoc Commit Message Enabled Disabled ------------------------------------- Initial commit 247 KB 32 KB Change A1 to 10 493 KB 59 KB Revert A1 to 1 738 KB 87 KB
I was expecting ZipDoc to result in a larger initial data file size, which it certainly did, but I was not expecting it to increase so rapidly with only very minor changes to the Excel file. If fact, with ZipDoc enabled, the data file size grew at virtually the same percentage rate as with ZipDoc disabled (but starting at a much higher initial size).
This basically means that ZipDoc was not performing well at all, at least for my particular test case.
For the Excel file in question, the structure of the zipped archive can be revealed by adding ".zip" to the name of the file and extracting the contents.
The result is this:
_rels .rels docProps app.xml core.xml xl _rels workbook.xml.rels theme theme1.xml worksheets sheet1.xml styles.xml workbook.xml [Content_Types].xml
I used Mercurial to analyze what files actually changed between two different versions of the Excel file. The answer was just xl/worksheets/sheet1.xml.
Opening this file, the reason for the dismal performance of ZipDoc became instantly obvious: the entire file was contained on one line (it was essentially compressed or "minified", containing no unnecessary white space). Since Mercurial compares files based on which lines have changed, it will always conclude that the entire file has changed. Hence, ZipDoc will never be able to do its job properly under these conditions.
When I actually formatted (prettified) sheet1.xml using Visual Studio's XML editor before committing it, Mercurial was able to find the one small difference between the two versions of the file, and it was able to store the delta very efficiently.
As long as Excel saves the component XML files as one-liners, ZipDoc won't be of any use, and will actually do more harm than good.
I'm not sure whether ZipDoc is being actively developed (the last update was in 2011), but in the hopes that it is, I posted an issue to the project's Bitbucket repository.
I'll try to keep this post updated as I learn more.comments powered by Disqus