UCL WIKI

UCL Logo

Blog from October, 2013

The nitty gritty

Excel files, or .xlsx files are a zipped archive of .xml files. It is therefore easy to unzip the file and then parse the individual .xml files to extract information that is required.

That is where the ease ends however, as there are some rather tricky aspects to converting the xml to text in a robust way.

XML schema for .xlsx files

Within the unzipped folder structure, the individual worksheets are stored in .xml files in the xl/worksheets/ folder.

The top of the file contains various pieces of header information, which I'll ignore for now. The interesting bits for the purposes of comparing text outputs from a conversion script are how the values, text and formulas are stored and how to get at these.

New rows are specified by

<row>
<\row>

Within the row header, attributes r="" and spans="" specify the row number and
number of columns in the row respectively.

<row r="6" spans="1:9">
</row>

Populated cells result in a <c><\c> header which also contains various
attributes. These attributes depend upon the content of the cell - which
defines the cell type.

Cell types attribute t="" include

  • s : string
  • <blank> : normal formula or a value only
  • array : array formula
  • shared : shared formula

Values

The simplest cell definition contains only a <v><\v> header.

<row r="3" spans="1:9">
        <c r="A3">
             <v>9.48</v>
        </c>
</row>

The value of the cell is stored in the <v><\v> header.

Strings

String are stored in a separate XML file called sharedStrings.xml and
referred to by an index which is stored in the <v><\v> header.

<row r="1" spans="1:9">
        <c r="A1" t="s">
            <v>0</v>
        </c>
</row>

The t="s" flag indicates that a lookup is required.

Formulas

There are three types of ways for storing formulas.

Stored directly

In this case, the text equivalent (minus the = sign) is stored in the
<f><\f> value. The ca="" attribute refers to the location within the
calcChain.xml file of this cell.

<c r="E7">
        <f ca="1">E6+$C$6</f>
        <v>4.45</v>
</c>

The result of the formula calculation is stored in the <v><\v> value.

Shared

Shared formulas are created in Excel when copying using fill command or
equivalent. The formula is stored once in the first cell in which it is
defined and assigned a shared index using the si="" attribute in the
<f><\f> header.

<c r="G8">
        <f t="shared" ref="G8:G10" si="1">F8+G7</f>
        <v>6</v>
</c>

In subsequent cells that share this formula, the formula is referenced
with the same shared index si="".

<c r="G9">
        <f t="shared" si="1"/>
        <v>10</v>
</c>

One issue with this is that Excel must recompute the references
dynamically upon opening, and so any text representation of shared
formulas must also be recomputed before accurate text based differences
could be computed.

Again, the computed result of the formula is stored in the <v><\v>
value.

Array

Array based formulas are stored directly in the value of the <f><\f>
header.

<c r="I9">
        <f t="array" ref="I9">H9</f>
        <v>14</v>
</c>

The computed result of the formula is stored in the <v><\v> value.

git is a distributed version control software package that enables very powerful management of software development projects. But what if you are developing an energy system model, whose inputs are large numbers of Excel spreadsheets?

Although Git works best with text files, Excel spreadsheets are not beyond its capabilities. The standard setup of git means that *.xlsx files are viewed as binary files.  Actually, xlsx files are a zipped archive of XML files, and so can be unzipped and compared relatively easily with the help of a script to parse the XML.

TIMES models consist of a folder containing a nested file structure of Excel spreadsheets. On model compilation, the spreadsheets are read using the VEDA front-end (a proprietary software package) into a locked Access database. But to perform effective version control of a TIMES model, it is necessary only to record the changes to the Excel spreadsheets themselves.

Below, I'll outline the best practice for working with git and Excel spreadsheet files.

But first, here some examples of what is possible. Here I'm using the TIMES_DEMO model. This TIMES model contains four main spreadsheets, one for each of four key sectors.

Here's a picture of one sheet in the VT_DEMOT_ELC_V5.xlsx workbook entitled COM_Attribs.

The python script produces text output of the worksheet in the following format:

=================================
Sheet: COM_Attribs[ 13 , 6 ]
=================================
    A1: * values associate with commodities
    C3: ~COMEMI
    C4: CommName
    D4: OILDST
    E4: GASNAT
    F4: COAHAR
    C5: ELCCO2N
    D5: 70.0
    E5: 49.0
    F5: 94.0
    C7: <more emittants here>
    C9: ~COMAGG
    C10: CommName
    D10: ELCCO2N
    E10: CH4
    C11: GHG
    D11: 0.001
    E11: 0.21
    F11: <more commodities>
    C13: <more commodities>

Okay, so I'll now make a change and enter the value 72 in cell D5, replacing the existing value of 70. Then I saved the spreadsheet.

Then I type git diff at a command prompt. I obtain the following report.
Sure enough, the change has been picked up. The - sign indicates the removal of a value, and the {+} the addition of a line.
Usefully, the heading in the text file, which corresponds to the sheet name (COM_Attribs in this case) is retained, so that it is easy to identify in which sheet of the workbook a change has been made.

$ git diff
diff --git a/VT_DEMOT_ELC_V5.xlsx b/VT_DEMOT_ELC_V5.xlsx
index 77ee5f0..8982118 100644
--- a/VT_DEMOT_ELC_V5.xlsx
+++ b/VT_DEMOT_ELC_V5.xlsx
@@ -356,7 +356,7 @@ Sheet: COM_Attribs[ 13 , 6 ]
     E4: GASNAT
     F4: COAHAR
     C5: ELCCO2N
-    D5: 70.0
+    D5: 72.0
     E5: 49.0
     F5: 94.0
     C7: <more emittants here>

Installation and setup

As TIMES works only on Windows systems, other operating system specific installs will not be covered, but the python script linked to later and some of the git setup routines are applicable to those using Unix and OS X systems too. Feel free to adapt these guidelines to your systems.

  1. If you're happy working at a command prompt (i.e. typing in text commands) download and install git for Windows. Otherwise, download and install github client for Windows and setup an account. The github website has a range of very useful help files to follow on how to setup git on your computer. These include entering a username and e-mail address and setting up password caching for signing in to the website.
  2. Next, we need to install Python. There are various distributions of Python out there, but for ease, I'd recommend using Enthought Canopy - which is a free (for academics) distribution of the Python programming language and a whole bunch of useful packages. Request an academic licence and then download and install.
  3. Now, download the Python script from my github repository online here and save the file as git_diff_xlsx.py in a memorable location on your drive.
  4. Add the following lines to the global .gitconfig file - this is normally in your user folder e.g. C:/Users/will/:
    [diff "zip"]
    binary = True
    textconv = python c:/path/to/git_diff_xlsx.py
    Note that in the final line above - your need to include the path to the git_diff_xlsx.py file you downloaded in the previous step.
  5. Add the following line to the repository's .gitattributes
    *.xlsx diff=zip

That's setup over and ready to go. Now, when you navigate to a git repository and type the git diff command, any changes to xlsx spreadsheet files will be recorded as human-readable differences.