2

I would like to store a GenomicRanges::GRanges object from Bioconductor as a single column in a base R data.frame. The reason I'd like to have it in a base R data.frame is because I'd like to write some ggplot2 functions that exclusively work with data.frames under the hood. However, any attempts I made don't seem to be fruitful. Basically this is what I want to do:

library(GenomicRanges)

x <- GRanges(c("chr1:100-200", "chr1:200-300"))

df <- data.frame(x = x, y = 1:2)

But the column is automatically expanded, whereas I like to keep it as a valid GRanges object in a single column:

> df
  x.seqnames x.start x.end x.width x.strand y
1       chr1     100   200     101        * 1
2       chr1     200   300     101        * 2

When I work with the S4Vectors::DataFrame, it works as I want, except I'd like a base R data.frame to do the same thing:

> S4Vectors::DataFrame(x = x, y = 1:2)
DataFrame with 2 rows and 2 columns
             x         y
     <GRanges> <integer>
1 chr1:100-200         1
2 chr1:200-300         2

I also tried the following without succes:

> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
  y                                                           x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2                                                        <NA>

Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs

df[["x"]] <- I(x)

Error in rep(value, length.out = nrows) : attempt to replicate an object of type 'S4'

I had some minor succes with implementing an S3 variant of the GRanges class using vctrs::new_rcrd, but that seems to be a very roundabout way to get a single column representing a genomic range.

teunbrand
  • 33,645
  • 4
  • 37
  • 63
  • Hi @teunbrand, I think most likely only S4Vectors::DataFrame allows that. Maybe another option is to convert back to GRanges on the fly? Like df = data.frame(y=1:2); df$x = data.frame(x) ; makeGRangesFromDataFrame(df$x) – StupidWolf Dec 17 '19 at 10:29
  • Yeah I was afraid so too. I could try to work around it by converting them back on the fly, but the tidy evaluation framework in ggplot doesn't really allow for this without me having to re-write some of the core ggproto classes (like Layer), which I'd rather not do. – teunbrand Dec 17 '19 at 10:47

3 Answers3

3

I found a very simple way to convert an GR object to a dataframe so that you can operate on the data.frame very easily. The annoGR2DF function in the Repitools package can do so.

> library(GenomicRanges)
> library(Repitools)
> 
> x <- GRanges(c("chr1:100-200", "chr1:200-300"))
> 
> df <- annoGR2DF(x)
> df
   chr start end width
1 chr1   100 200   101
2 chr1   200 300   101
> class(df)
[1] "data.frame"
Z. Zhang
  • 637
  • 4
  • 16
0

A not pretty but practical solution is to use the accessor functions of GenomicRanges, then convert to the relevant data vector, i.e. numeric or character. I added magrittr, but you can also do it without it.

library(GenomicRanges)
library(magrittr)

x <- GRanges(c("chr1:100-200", "chr1:200-300"))
df <- data.frame(y = 1:2)
df$chr <- seqnames(x) %>% as.character
df$start <- start(x) %>% as.numeric
df$end <- end(x) %>% as.numeric
df$strand <- strand(x) %>% as.character
df$width <- width(x) %>% as.numeric
df
  • Thank you for taking the time to write an answer, but this the exact opposite of what I want to be able to do. I would like to keep the GRanges as a single column, in a base R data.frame. I've edited my question to state 'single column' more clearly as to prevent future confusion. – teunbrand Dec 17 '19 at 17:22
  • @teunbrand Oh, sorry I misunderstood the question. Will erase later, hopefully someone gets to the answer but seems likely that the S4Vectors::DataFrame is the only way around it. – Angel Garcia Campos Dec 17 '19 at 23:17
0

So since posting this question, I figured out that the crux of my problem seemed to be that just the format method of S4 objects is not playing nicely with the data.frames, and having GRanges as columns isn't necessarily a problem. (The construction of the data.frame still is though).

Consider this bit of the original question:

> df <- data.frame(y = 1:2)
> df[["x"]] <- x
> df
  y                                                           x
1 1 <S4 class ‘GRanges’ [package “GenomicRanges”] with 7 slots>
2 2   

Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs

If we write a simple format method for GRanges, it will not throw an error:

library(GenomicRanges)

format.GRanges <- function(x, ...) {showAsCell(x)}

df <- data.frame(y = 1:3)

df$x <- GRanges(c("chr1:100-200", "chr1:200-300", "chr2:100-200"))
> df
  y            x
1 1 chr1:100-200
2 2 chr1:200-300
3 3 chr2:100-200

It seems to subset just fine too:

> df[c(1,3),]
  y            x
1 1 chr1:100-200
3 3 chr2:100-200

As a bonus, this seems to work for other S4 classes too, for example:

library(S4Vectors)

format.Rle <- function(x, ...) {showAsCell(x)}

x <- Rle(1:5, 5:1)

df <- data.frame(y = 1:15)
df$x <- x
teunbrand
  • 33,645
  • 4
  • 37
  • 63