HDF5 + XML = SDCubes

A paper on data organization just came out in Nature Methods (Millard et al. 2011, commentary by Swedlow et al.). They believe, as do I, that using XML schema to organize data is a good way to simplify automated analysis. For the primary container, they use HDF5. Fun fact for all the MATLAB users in the audience (from Tim at Imperial College):

The .mat file format is simply an HDF5 file with a pointless header prepended.

They call the XML metadata+HDF5 data combos “SDCubes”, for Semantically typed Data hyperCubes. Why cubes? That suggests that they are the same size along all axes, which they probably aren’t. If you don’t like hyperrectangle, you can use orthotope. One point that is lost in the figure I put above is the idea that the axes do not have to be continuous. There can be gaps and jumps. There can also be piles of data that all share one point on an axis, if that suits the data.

I like this approach because it is very general and simple. It consists of two file formats that are already being used by many researchers. In a way, the authors didn’t “create” anything. Hopefully this paper will give the strategy some added credibility and help to standardize it. Then people can concentrate on developing tools for working with data in this system, rather than developing new formats all the time.