Tabular Data Formats

I’ve received a few queries recently that really boil down to a misunderstanding of the differences between tabular data formats and spreadsheets, so I thought I’d write a quick guide around the subject.

Delineated Text

These are more commonly known as “comma-separated values” (*.csv) or “tab delineated values” (*.tab), but are simply text files that describe a table. Usually, each line forms a row, and the value in each column is separated by some symbol (delineation), like a comma (,), semicolon (;) or a tab. Unfortunately, there is no standard definition of delineated text tables, but fortunately the parameters can usually be worked out. The key things to work out are:

  • delineation between values (comma, semicolon, tab etc.)
  • presence or absence of headers in the first row
  • whether quotation marks (“”) are used to surround each field, or only where text is present
  • what type of decimal point is used (usually a point (.) but occasionally a comma (,)
  • the encoding of the text file (e.g. UTF-8)

This is my preferred format – it is minimal, platform-independent and supported by all major databases, programming languages, statistics packages and spreadsheet applications. See below for an example of delineated tabular data that uses commas to separate values, quotation marks around text, decimal points, and has a header row. In R you can create your own example using write.csv(head(iris), file = "iris.csv").

"","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"
"5",5,3.6,1.4,0.2,"setosa"
"6",5.4,3.9,1.7,0.4,"setosa"

Spreadsheets

The two Excel spreadsheet data formats are *.xls and *.xlsx. These formats are technically very different, but functionally serve a similar purpose. Not only are tabular data stored (often multiple sheets), but formatting, formulae, and other metadata. For the purposes of data import and export, we are usually only interested in the cell values, so all of this extra data only serves to complicate the process. Spreadsheets are fundamentally different from tabulated data.

The Microsoft Excel Spreadsheet (*.xls) is a proprietary binary data format. Many other software packages have implemented this format based on the documentation released, but Excel now support the Microsoft Excel Open XML Format (*.xlsx).

This format is based on extensible markup language (*.xml), a well-defined way of creating structured documents that are both machine and human-readable. A number of these files sit within a folder structure that defines different spreadsheets etc., and the entire collection is then compressed (zipped) for efficiency and convenience. See below for an example of the code in an Excel Open XML sheet (created in R using the openxlsx library, e.g. write.xlsx(head(iris), file = "iris.xlsx").

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"> <dimension ref="A1"/> <sheetViews><sheetView workbookViewId="0" zoomScale="100" showGridLines="1" tabSelected="1"/></sheetViews> <sheetFormatPr defaultRowHeight="15.0"/><sheetData><row r="1"><c r="A1" t="s"><v>0</v></c><c r="B1" t="s"><v>1</v></c><c r="C1" t="s"><v>2</v></c><c r="D1" t="s"><v>3</v></c><c r="E1" t="s"><v>4</v></c></row><row r="2"><c r="A2" t="n"><v>5.1</v></c><c r="B2" t="n"><v>3.5</v></c><c r="C2" t="n"><v>1.4</v></c><c r="D2" t="n"><v>0.2</v></c><c r="E2" t="s"><v>5</v></c></row><row r="3"><c r="A3" t="n"><v>4.9</v></c><c r="B3" t="n"><v>3</v></c><c r="C3" t="n"><v>1.4</v></c><c r="D3" t="n"><v>0.2</v></c><c r="E3" t="s"><v>5</v></c></row><row r="4"><c r="A4" t="n"><v>4.7</v></c><c r="B4" t="n"><v>3.2</v></c><c r="C4" t="n"><v>1.3</v></c><c r="D4" t="n"><v>0.2</v></c><c r="E4" t="s"><v>5</v></c></row><row r="5"><c r="A5" t="n"><v>4.6</v></c><c r="B5" t="n"><v>3.1</v></c><c r="C5" t="n"><v>1.5</v></c><c r="D5" t="n"><v>0.2</v></c><c r="E5" t="s"><v>5</v></c></row><row r="6"><c r="A6" t="n"><v>5</v></c><c r="B6" t="n"><v>3.6</v></c><c r="C6" t="n"><v>1.4</v></c><c r="D6" t="n"><v>0.2</v></c><c r="E6" t="s"><v>5</v></c></row><row r="7"><c r="A7" t="n"><v>5.4</v></c><c r="B7" t="n"><v>3.9</v></c><c r="C7" t="n"><v>1.7</v></c><c r="D7" t="n"><v>0.4</v></c><c r="E7" t="s"><v>5</v></c></row></sheetData><pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/><pageSetup paperSize="9" orientation="portrait" horizontalDpi="300" verticalDpi="300" r:id="rId2"/></worksheet>

Work with Delineated Text

It is clear from the examples given that for tabulated data, delineated text is preferable for both compatibility and efficiency. For more complex data (e.g. multivariate time series, or geo-referenced multivariate data), Network Common Dataform (NetCDF) is often preferable. Only where work needs to be done within Excel, Numbers or Sheets, and contains formulae or formatting, should spreadsheet formats be used.