Tabular Data Formats

I’ve received a few queries recently that really boil down to a misunderstanding of the differences between tabular data formats and spreadsheets, so I thought I’d write a quick guide around the subject.

Delineated Text

These are more commonly known as “comma-separated values” (*.csv) or “tab delineated values” (*.tab), but are simply text files that describe a table. Usually, each line forms a row, and the value in each column is separated by some symbol (delineation), like a comma (,), semicolon (;) or a tab. Unfortunately, there is no standard definition of delineated text tables, but fortunately the parameters can usually be worked out. The key things to work out are:

  • delineation between values (comma, semicolon, tab etc.)
  • presence or absence of headers in the first row
  • whether quotation marks (“”) are used to surround each field, or only where text is present
  • what type of decimal point is used (usually a point (.) but occasionally a comma (,)
  • the encoding of the text file (e.g. UTF-8)

This is my preferred format – it is minimal, platform-independent and supported by all major databases, programming languages, statistics packages and spreadsheet applications. See below for an example of delineated tabular data that uses commas to separate values, quotation marks around text, decimal points, and has a header row. In R you can create your own example using write.csv(head(iris), file = "iris.csv").

"","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"
"5",5,3.6,1.4,0.2,"setosa"
"6",5.4,3.9,1.7,0.4,"setosa"

Spreadsheets

The two Excel spreadsheet data formats are *.xls and *.xlsx. These formats are technically very different, but functionally serve a similar purpose. Not only are tabular data stored (often multiple sheets), but formatting, formulae, and other metadata. For the purposes of data import and export, we are usually only interested in the cell values, so all of this extra data only serves to complicate the process. Spreadsheets are fundamentally different from tabulated data.

The Microsoft Excel Spreadsheet (*.xls) is a proprietary binary data format. Many other software packages have implemented this format based on the documentation released, but Excel now support the Microsoft Excel Open XML Format (*.xlsx).

This format is based on extensible markup language (*.xml), a well-defined way of creating structured documents that are both machine and human-readable. A number of these files sit within a folder structure that defines different spreadsheets etc., and the entire collection is then compressed (zipped) for efficiency and convenience. See below for an example of the code in an Excel Open XML sheet (created in R using the openxlsx library, e.g. write.xlsx(head(iris), file = "iris.xlsx").

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac"> <dimension ref="A1"/> <sheetViews><sheetView workbookViewId="0" zoomScale="100" showGridLines="1" tabSelected="1"/></sheetViews> <sheetFormatPr defaultRowHeight="15.0"/><sheetData><row r="1"><c r="A1" t="s"><v>0</v></c><c r="B1" t="s"><v>1</v></c><c r="C1" t="s"><v>2</v></c><c r="D1" t="s"><v>3</v></c><c r="E1" t="s"><v>4</v></c></row><row r="2"><c r="A2" t="n"><v>5.1</v></c><c r="B2" t="n"><v>3.5</v></c><c r="C2" t="n"><v>1.4</v></c><c r="D2" t="n"><v>0.2</v></c><c r="E2" t="s"><v>5</v></c></row><row r="3"><c r="A3" t="n"><v>4.9</v></c><c r="B3" t="n"><v>3</v></c><c r="C3" t="n"><v>1.4</v></c><c r="D3" t="n"><v>0.2</v></c><c r="E3" t="s"><v>5</v></c></row><row r="4"><c r="A4" t="n"><v>4.7</v></c><c r="B4" t="n"><v>3.2</v></c><c r="C4" t="n"><v>1.3</v></c><c r="D4" t="n"><v>0.2</v></c><c r="E4" t="s"><v>5</v></c></row><row r="5"><c r="A5" t="n"><v>4.6</v></c><c r="B5" t="n"><v>3.1</v></c><c r="C5" t="n"><v>1.5</v></c><c r="D5" t="n"><v>0.2</v></c><c r="E5" t="s"><v>5</v></c></row><row r="6"><c r="A6" t="n"><v>5</v></c><c r="B6" t="n"><v>3.6</v></c><c r="C6" t="n"><v>1.4</v></c><c r="D6" t="n"><v>0.2</v></c><c r="E6" t="s"><v>5</v></c></row><row r="7"><c r="A7" t="n"><v>5.4</v></c><c r="B7" t="n"><v>3.9</v></c><c r="C7" t="n"><v>1.7</v></c><c r="D7" t="n"><v>0.4</v></c><c r="E7" t="s"><v>5</v></c></row></sheetData><pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/><pageSetup paperSize="9" orientation="portrait" horizontalDpi="300" verticalDpi="300" r:id="rId2"/></worksheet>

Work with Delineated Text

It is clear from the examples given that for tabulated data, delineated text is preferable for both compatibility and efficiency. For more complex data (e.g. multivariate time series, or geo-referenced multivariate data), Network Common Dataform (NetCDF) is often preferable. Only where work needs to be done within Excel, Numbers or Sheets, and contains formulae or formatting, should spreadsheet formats be used.

New Amateur Radio Callsign

I passed my final amateur radio exam, so I now have a full licence and callsign M0TKG (I’ve surrendered my old signs 2E0TKG and M6TKG). I’ll be running a little portable low-power (QRP) HF station and will also be active on the 6M, 2M (VHF) and 70CM (UHF) bands in south Manchester. I’m also thinking of taking my Morse exam, but we’ll see!

Soil Texture App

Need to draw a neat looking soil texture diagram? The soiltexture app in R is really powerful. For those in a hurry for coursework, I’ve made a little web app available here, but for those doing more advanced work, I’d recommend having a look at Julian Moeys guide to the package here.

Measuring Objects In Photomicrographs Using ImageJ

Many of our lab users have been trying to use various Zeiss software packages to make measurements from photomicrographs. It is often easier and quicker to use ImageJ to make these measurements.

Screenshot of measurement in progress. Here, each micro bead is selected using the oval selection tool. The image scale has been set and a scale has been added. The mean diameter is 263 ± 78 μm (thanks to Jiawei Li, UoM for providing this image).

Here’s how:

Image Acquisition

  1. Use the Zeiss Zen to acquire an image of a graticule or rule under the magnification settings you will using for your imaging.
  2. Acquire the images you need.
  3. Save the images in the Zen software as JPEG or TIFF files. Remember, keep your filenames and folder structure organised!

Calibrating the Scale

  1. Open the ImageJ software. Open the image of the graticule or rule by selecting File > Open…
  2. Using the line tool (5th button from the left), select two points of known distance. It is best to use the widest possible two points for best accuracy (e.g. if you can see 2 mm of marks in the image, don’t select only 1 mm mark distance). Use the zoom controls (10th from the left, then + and – keys) as required.
  3. Click Analyze > Set Scale… and type the known distance (e.g. 2) in the second box down. Also, write the units (e.g. mm) in the appropriate box. Check the “Global” box to apply this to all images.
  4. If you have changed the magnification, you can change the scale manually by adjusting the Known distance or Distance in pixels, but it is usually better to perform a new calibration for each magnification.

Making Measurements

  1. Open the image file you wish to analyse using File > Open… or File > Open Next.
  2. Select Analyze > Tools > ROI Manager.
  3. Check the “Show All” and “Labels” boxes in the ROI Manager dialogue.
  4. Using the straight line tool, select the edges of the object you want to measure. Press the [T] key or click the Add button in the ROI Manager.
  5. Repeat until you have selected all of the objects you wish to measure.

If you are measuring circular or oval objects, you may have better results using the oval selection tool (second from the left). You may wish to experiment with different selection tools depending on your measurement.

Exporting Data

  1. When you have selected all of the objects you wish to measure, select “Measure” in the ROI Manager.
  2. You may wish to obtain summary statistics – in this case, select Results > Summarize.
  3. The Results dialogue will open. Save the data by selecting File > Save As…
  4. If you name your file with a *.csv extension, you will be able to open it directly in statistics software of your choice.

Automated Analysis

You don’t have to select everything manually. The particles can be analysed automatically. Firstly, the colour image needs to be converted into binary. Then, individual particles need to be separated where they overlap, and finally, the automatic particle recognition can be run and summary statistics obtained.

  1. Open the image file and ensure the scale is set correctly.
  2. Open image > adjust > colour threshold and adjust the settings so that the particles are correctly picked out from the background. Set the Threshold colour to black & white.
  3. Convert the image to black and white format by selecting Image > Type > 8-bit
  4. If the particles have “holes” in, select Process > Binary > Fill Holes. You should see the particles fill.
  5. To sort out overlapping particles, select Process > Binary > Watershed. You should see small white lines between all the particles.
  6. Select Analyze > Analyze Particles. You may need to adjust the size range to eliminate objects that are not of interest. Note the size is expressed in mm^2, so for circular objects you’ll need to convert using area = 3.14 * diameter/2. The lower end of the circularity setting can also be adjusted for optimum results. For the example image I’ve used a size range of 0.01-4mm^2 and a circularity of 0.6-1.
  7. In the show menu, select “Ellipses”.
  8. Select “Add to manager”, “Exclude on edges”, and use “Display results”, “Clear results” and “Summarize” as required.
  9. You can manually delete or add ellipses using the ROI Manager.

Other Tools

There’s loads of useful tools in ImageJ, including many that can speed up and improve repetitive workflows, and assist in preparing images for publication. Just pop by the office for a chat if you are interested in a demonstration of some features.

New Amateur Radio Callsign

I’ve just passed my intermediate amateur radio exams with the help of Stockport Radio Society G8SRS. I’ll be keeping my previous call-sign M6TKG for now, but I’ll also now be using 2E0TKG under certain circumstances. As a gift to myself for passing, I’ve started putting together a little portable rig for low-power (QRP) work based on a Yaesu FT-817ND. I’ve already started reading for my full licence test, and have just begun working towards my 12 words-per-minute Morse code test. I hope that soon I’ll be in a position to bring a rig away when I travel, and make some contacts from some of the unusual places I sometimes have the opportunity to visit.

High-resolution proglacial lake records of pre-Little Ice Age glacier advance, northeast Greenland

I’m pleased to announce that a paper I was involved in (led by colleagues from Manchester Metropolitan University and Liverpool John Moores University) has now been published in Boreas. The paper presents a record of glacial activity in Greenland over centennial time-scales using lake sediment geochemical proxies. My contribution was in the multi-variate statistical treatment of the elemental data, alongside some modelling work on the chronology. The paper is available online here

Bicycle Distance Sensor

Introduction

A student recently came to us with a project that would make use of Dr Ian Walker’s bicycle overtaking sensor, so I agreed to make him one. Since Dr Walker published his original article, the code has had to change a bit, so I thought this would be a good opportunity to post the updated code and include a complete bill of materials.

Bill of Materials

Code for Arduino

You’ll need a USB cable to program the Arduino. I have made some modifications to Dr Walker’s original code. They address four points:

  1. The RTC (real-time-clock) used in the latest revision of the Datalogging Arduino Shield has changed. This addresses the changes needed.
  2. There is a new library available for the Maxbotix distance sensors which is a far more elegant than the old alternatives. 
  3. I’ve decided to use the internal pull-up resistor on the button input, rather than an external pull-down. It saves on the wiring!
  4. I’ve added a serial output for checking the unit is working correctly.

The revised code is available here. Remember that the first time you use the RTC, you need to program it – instructions can be found here

Building the Unit

It should be pretty obvious how this all goes together from the photographs, code and wiring diagrams, but a few hints:

  • The handlebar button fits in the printed enclosure, and the cable can be secured with a cable tie to stop it pulling on the button connections. A tiny dab of super-glue holds the button in place.
  • The Arduino and shield can be secured using the base plate that comes with all Genuino’s using the self-tapping screws, and the base plate secured in the enclosure using the sticky fixers. 
  • I secured the Maxbotix sensor to the enclosure by screwing two standoffs to it, and gluing the standoffs to the inside of the enclosure. 
  • I thought it easier to connect the power button using crimp connectors. This allowed me to easily fit the button to the enclosure. The battery holders are held down with sticky fixers. 
The Fritzing file for this diagram is available here
The unit, showing the inside of the enclosure and the handlebar mounted control unit.
The unit, showing the inside of the enclosure and the handlebar mounted control unit (bottom right).

Conclusions

I’ll update this section in due course when we have the results of the research conducted with this sensor!

Rescaling for Compression

Sometimes when coring, cores can become compressed in the chamber. For example, if you know you’ve started a drive at 1 m, and finished at 2 m, but only collected 0.95 m of sediment, you may have compression. This is often a problem for Livingstone-type corers, but not an issue with Russian-types. This can be caused by:

  • stretch in locking lines for corer heads (or ground compression if those lines are locked to the surface),
  • friction of sediment inside the tube,
  • airlocks inside the tube,
  • poor cutting performance of the corer end.

It’s difficult to know whether you’ve lost material or suffered compression, so good practice says the stratigraphy should be cross-correlated with an adjacent sequence. If there is compression, this can be corrected for, but in practice it can be a pain. Here’s a simple way of doing it in R. Here I’m correcting the depths for sub-samples taken at 10 mm intervals, in a core that should have been 1 m in length, from 1 m to 2 m depth, but was actually 0.95 m in length when extracted.

length <- 950     # the measured length of the core when extracted
truestart <- 1000 # the starting depth of the drive
trueend <- 2000   # the finishing depth of the drive
intervals <- 10   # the sampling interval

# create the sampling sequence - this may already be available in your data
z <- seq(from = trueend-length, to = trueend, by = intervals)

# correct those depths
zc <- seq(from = truestart, to = trueend, length.out = length(z))

# compare the data
df <- data.frame(original = z,
                 corrected = round(zc, digits = 0),
                 difference = round(z-zc, digits = 0)
                 )
head(df)
tail(df)

This code assumes that there’s more compression at the top, and less at the bottom of the sequence, which is not unreasonable given the possibles causes listed above. Sometimes cores can expand due to decompression of the sediment following coring. This is particularly common in gytjja from lakes, where the weight of the water column compresses the sediment in-situ, thus it expands ex-situ. Take an example where we have a core of length 1.05 m, but we know we only cored from 1 m to 2 m; we need to compress the core back to 1 m length. The following code deals with this problem:

length <- 1050    # the measured length of the core when extracted
truestart <- 1000 # the starting depth of the drive
trueend <- 2000   # the finishing depth of the drive
intervals <- 10   # the sampling interval

# create the sampling sequence - this may already be available in your data
z <- seq(from = truestart, to = truestart+length, by = intervals)

# shift the depths so they are centered
zs <- z-(max(z)-trueend)/2

# correct these depths
zc <- (trueend-truestart)/length * zs + mean(c(trueend, truestart)) - ((trueend-truestart)/length) * mean(zs)

# compare the data
df <- data.frame(original = z,
                 shifted = zs,
                 corrected = round(zc, digits = 0),
                 difference = round(zs-zc, digits = 0)
                 )
head(df)
tail(df)

Official New Divisions for the Holocene

It’s official! The I.C.S. have announced the subdivisions for the Holocene, so I’ve made a helpful graphic to summarise their announcement, which I’ve reproduced in full below.

A high resolution copy can be found on the resources page.

Formal subdivision of the Holocene Series/Epoch

It has been announced that the proposals for the subdivision of the Holocene Series/Epoch (11 700 years ago to the present day) into three stages/ages and their corresponding subseries/subepochs by the International Subcommission on Quaternary Stratigraphy (ISQS) (a subcommission of the International Commission on Stratigraphy – ICS) have been ratified unanimously by the International Union of Geological Sciences (IUGS). The subdivisions now formally defined are:

1. Greenlandian Stage/Age = Lower/Early Holocene Subseries/Subepoch
Boundary Stratotype (GSSP): NorthGRIP2 ice core, Greenland (coincident with the Holocene Series/Epoch GSSP, ratified 2008). Age: 11,700 yr b2k (before AD 2000).

2. Northgrippian Stage/Age = Middle/Mid-Holocene Subseries/Subepoch
Boundary Stratotype (GSSP): NorthGRIP1 ice core, Greenland. Global Auxiliary Stratotype: Gruta do Padre Cave speleothem, Brazil. Age: 8326 yr b2k.

3. Meghalayan Stage/Age = Upper/Late Holocene Subseries/Subepoch
Boundary stratotype (GSSP): Mawmluh Cave speleothem, Meghalaya, India. Global Auxiliary Stratotype, Mount Logan ice core, Canada. Age: 4250 yr b2k.

These divisions are now each defined by Global Stratotype Sections and Points (GSSPs), which means that they are fixed in time in sedimentary sequences. The terms Greenlandian Stage/Age, Northgrippian Stage/Age, Meghalayan Stage/Age, Lower/Early Holocene Subseries/Subepoch, Middle/Mid-Holocene Subseries/Subepoch and Late/Upper Holocene Subseries/Subepoch therefore have formal definitions and boundaries.

These definitions represent the first formal geological subdivision of the Holocene Series/Epoch, resulting from over a decade of labour by members of the joint ISQS (International Subcommission on Quaternary Stratigraphy) – INTIMATE Members Working Group (Integration of Ice-core, Marine and Terrestrial Records), led by Professor Mike Walker (University of Aberystwyth).

Phil Gibbard
Secretary General ICS
Cambridge 26.6.18