I'm just learning to use the rhdf5 package. It seems like, for creating and indexing a matrix without dimnames, the operations are really straight-forward
library(rhdf5)
my.mat <- matrix(rnorm(400,2,1), nrow=100, ncol=4)
fl <- tempfile()
h5createFile(fl)
h5write(my.mat, fl, "mat")
h5read(fl, "mat", list(2:3, 3:4))
## [,1] [,2]
## [1,] 0.3199968 1.947390
## [2,] 1.3338179 2.623461
h5read(fl, "mat", list(2:3, NULL))
## [,1] [,2] [,3] [,4]
## [1,] 1.247648 -0.380762 0.3199968 1.947390
## [2,] 3.157954 1.334057 1.3338179 2.623461
It seems like the package supports some functionality, e.g., for writing data.frame
objects, but I ended up 'rolling my own' function to create and subset / select for a matrix with dimnames. Here's the write function, which adds HDF5 attributes to the data set
h5matrix_write <-
function(obj, file, name, ...)
{
if (!is.matrix(obj) || is.null(dimnames(obj)) ||
any(sapply(dimnames(obj), is.null)))
stop("'obj' must be a matrix with row and column names")
fid <- if (file.exists(file))
H5Fopen(file)
else
H5Fcreate(file)
h5createDataset(fid, name, dim=dim(obj))
did <- H5Dopen(fid, name)
h5createAttribute(fid, "rownames", nrow(obj), storage.mode="character",
size=max(nchar(rownames(obj))))
h5createAttribute(fid, "colnames", ncol(obj), storage.mode="character",
size=max(nchar(colnames(obj))))
h5writeDataset(obj, fid, name)
h5writeAttribute(rownames(obj), did, "rownames")
h5writeAttribute(colnames(obj), did, "colnames")
H5Dclose(did)
H5Fclose(fid)
file
}
For reading in a subset, I check to see if the index is a character vector. If so, I identify the index into the matrix and use that to extract the relevant values
h5matrix_select <-
function(file, name, i, j, ...)
{
## FIXME: doesn't handle logical subsetting
fid <- H5Fopen(fl)
did <- H5Dopen(fid, "mat")
rownames <- H5Aread(H5Aopen(did, "rownames"))
if (missing(i))
i <- seq_along(rownames)
else if (is.character(i)) {
i <- match(i, rownames)
if (any(is.na(i)))
stop(sum(is.na(i)), " unknown row names")
}
rownames <- rownames[i]
colnames <- H5Aread(H5Aopen(did, "colnames"))
if (missing(j))
j <- seq_along(colnames)
else if (is.character(j)) {
j <- match(j, colnames)
if (any(is.na(j)))
stop(sum(is.na(j)), " unknown colnames")
}
colnames <- colnames[j]
value <- h5read(file, name, list(i, j))
dimnames(value) <- list(rownames, colnames)
value
}
In action:
dimnames(my.mat) <- list(paste0("rid", seq_len(nrow(my.mat))),
paste0("cid", seq_len(ncol(my.mat))))
fl <- h5matrix_write(my.mat, tempfile(), "mat")
h5matrix_select(fl, "mat", 4:5, 2:3)
## cid2 cid3
## rid4 0.4716097 2.3490782
## rid5 2.0896238 0.5141749
h5matrix_select(fl, "mat", 4:5)
## cid1 cid2 cid3 cid4
## rid4 2.0947833 0.4716097 2.3490782 3.139687
## rid5 0.8258651 2.0896238 0.5141749 2.509301
set.seed(123)
h5matrix_select(fl, "mat", sample(rownames(my.mat), 3), 2:3)
## cid2 cid3
## rid29 0.6694079 3.795752
## rid79 2.1635644 2.892343
## rid41 3.7779177 1.685139
(h5read(fl, "mat", read.attributes=TRUE)
reads everything in; I think the simpler approach from @jimmyb (storing the row names as a separate variable) would also work with rhdf5.
It's appropriate to ask questions about Bioconductor packages on the Bioconductor mailing list, where the package author might be more likely to see it.