4

Let us say I want to store a matrix of large dimensions as an HDF5 file, and then I want read ONLY some slicing of the matrix by ID (by ID I mean the row names of the matrix, which in my case are unique so they can serve as ID). Is there a fast way to do this using HDF5 files?

Example.

my.mat = matrix(rnorm(400,2,1), nrow=100, ncol=4)
rownames(my.mat) = paste("id", c(1:100), sep="")

Then, I'd do something like this:

h5createFile("test.h5")
h5createDataset(file="test.h5", dataset="dat", dim=c(100,4), not sure what goes here)

And then I'd like to read the h5 file but ONLY a slice by row id, i.e., only read rows,

rows.to.read= c("id4", "id10", "id40")

The major objective is to make the slicing reading (by slicing reading I mean the reading by IDs) and the writing of entire HDF5 files as fast as possible since these are all very large datasets. I'd imagine one would have to store them in a way that they are hashed by row_name, such that the retrieval by ID can go fast without having to sort and/or scan the whole document.

Dnaiel
  • 7,622
  • 23
  • 67
  • 126

3 Answers3

2

I'm just learning to use the rhdf5 package. It seems like, for creating and indexing a matrix without dimnames, the operations are really straight-forward

library(rhdf5)
my.mat <- matrix(rnorm(400,2,1), nrow=100, ncol=4)
fl <- tempfile()
h5createFile(fl)
h5write(my.mat, fl, "mat")
h5read(fl, "mat", list(2:3, 3:4))
##           [,1]     [,2]
## [1,] 0.3199968 1.947390
## [2,] 1.3338179 2.623461
h5read(fl, "mat", list(2:3, NULL))
##          [,1]      [,2]      [,3]     [,4]
## [1,] 1.247648 -0.380762 0.3199968 1.947390
## [2,] 3.157954  1.334057 1.3338179 2.623461

It seems like the package supports some functionality, e.g., for writing data.frame objects, but I ended up 'rolling my own' function to create and subset / select for a matrix with dimnames. Here's the write function, which adds HDF5 attributes to the data set

h5matrix_write <-
    function(obj, file, name, ...)
{
    if (!is.matrix(obj) || is.null(dimnames(obj)) ||
        any(sapply(dimnames(obj), is.null)))
        stop("'obj' must be a matrix with row and column names")

    fid <- if (file.exists(file))
        H5Fopen(file)
    else
        H5Fcreate(file)
    h5createDataset(fid, name, dim=dim(obj))
    did <- H5Dopen(fid, name)

    h5createAttribute(fid, "rownames", nrow(obj), storage.mode="character",
                      size=max(nchar(rownames(obj))))
    h5createAttribute(fid, "colnames", ncol(obj), storage.mode="character",
                      size=max(nchar(colnames(obj))))

    h5writeDataset(obj, fid, name)
    h5writeAttribute(rownames(obj), did, "rownames")
    h5writeAttribute(colnames(obj), did, "colnames")

    H5Dclose(did)
    H5Fclose(fid)

    file
}

For reading in a subset, I check to see if the index is a character vector. If so, I identify the index into the matrix and use that to extract the relevant values

h5matrix_select <-
    function(file, name, i, j, ...)
{
    ## FIXME: doesn't handle logical subsetting
    fid <- H5Fopen(fl)
    did <- H5Dopen(fid, "mat")
    rownames <- H5Aread(H5Aopen(did, "rownames"))
    if (missing(i))
        i <- seq_along(rownames)
    else if (is.character(i)) {
        i <- match(i, rownames)
        if (any(is.na(i)))
            stop(sum(is.na(i)), " unknown row names")
    }
    rownames <- rownames[i]

    colnames <- H5Aread(H5Aopen(did, "colnames"))
    if (missing(j))
        j <- seq_along(colnames)
    else if (is.character(j)) {
        j <- match(j, colnames)
        if (any(is.na(j)))
            stop(sum(is.na(j)), " unknown colnames")
    }
    colnames <- colnames[j]

    value <- h5read(file, name, list(i, j))
    dimnames(value) <- list(rownames, colnames)
    value
}

In action:

dimnames(my.mat) <- list(paste0("rid", seq_len(nrow(my.mat))),
                         paste0("cid", seq_len(ncol(my.mat))))
fl <- h5matrix_write(my.mat, tempfile(), "mat")
h5matrix_select(fl, "mat", 4:5, 2:3)
##           cid2      cid3
## rid4 0.4716097 2.3490782
## rid5 2.0896238 0.5141749
h5matrix_select(fl, "mat", 4:5)
##           cid1      cid2      cid3     cid4
## rid4 2.0947833 0.4716097 2.3490782 3.139687
## rid5 0.8258651 2.0896238 0.5141749 2.509301
set.seed(123)
h5matrix_select(fl, "mat", sample(rownames(my.mat), 3), 2:3)
##            cid2     cid3
## rid29 0.6694079 3.795752
## rid79 2.1635644 2.892343
## rid41 3.7779177 1.685139

(h5read(fl, "mat", read.attributes=TRUE) reads everything in; I think the simpler approach from @jimmyb (storing the row names as a separate variable) would also work with rhdf5.

It's appropriate to ask questions about Bioconductor packages on the Bioconductor mailing list, where the package author might be more likely to see it.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
2

You can try the h5r package who's API is a little lighter-weight than the rhdf5 package, but less full-featured.

require(h5r)
my.mat = matrix(rnorm(400,2,1), nrow=100, ncol=4)
rnames = paste("id", c(1:100), sep="")

f = H5File("mymat.h5", 'w')
d = createH5Dataset(f, "mymat", my.mat)
r = createH5Dataset(f, "rnames", rnames)

Then slicing/indexing by integer/sequence is as expected (and not all read into memory)

d[1:2,1:2]
     [,1]       [,2]
[1,] 2.777984  1.6040137
[2,] 3.406609 -0.5168481

At the moment, you have to jump through a hoop with the rownames because the hdf5 data structures are all of one type (ignoring more complicated hdf5 types)

d[match(rows.to.read, r[]),]

disclaimer: I'm the author of the h5r package.

jimmyb
  • 4,227
  • 2
  • 23
  • 26
0

Check out http://www.bioconductor.org/packages/devel/bioc/manuals/rhdf5/man/rhdf5.pdf page 13 for how to read in subsets, specifically the index argument of h5read.

This should allow you to read in subsets.

The package isn't in CRAN, but the first answer here: How to deal with hdf5 files in R? should help you get where you need in terms of getting the package itself.

Community
  • 1
  • 1
James Tobin
  • 3,070
  • 19
  • 35
  • 1
    thanks. I find the documentation is quite crytptic. INDEX: List of indices for subsetting. The length of the list has to agree with the dimensional extension of the HDF5 array. Each list element is an integer vector of indices. A list element equal to NULL choses all indices in this dimension. Counting is R-style 1-based. How does this translate to indexing all the row by row name? – Dnaiel Jan 16 '14 at 22:06