0

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.

My data.frame has the following structure:

CHR POS var_A var_B
1 192 0.9 0.7
1 2000  0.8 0.3
2 3 0.21  0.76 
2 30009 0.36  0.15
...

The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).

The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.

I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.

x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
  df=freq[0,]
  for (i in unique(freq$CHR)) {
    t = freq[freq$CHR==i,]
    fim = t[nrow(t),2]
    i = t[1,2]
    f = i + x
    while(f < fim) {
      rows = which(t$POS>=i & t$POS<f)
      s = sample(rows)
      df = rbind(df,t[s,])
      i = f
      f = f + x
    }
  }
  cors = c(cors, cor(df$var_A, df$var_B))
}

Could anybody help me out? I am sure there is a more efficient way of doing this.

Thank you in advance.

uller
  • 115
  • 2
  • 10
  • How do the blocks need to be defined on `POS`? – Val Jun 24 '17 at 15:09
  • Because I need to sample one row every 1kb-blocks based on the position in the genome. There may be instances where I have more than one row inside these 1kb-blocks but instances where this does not happen. – uller Jun 24 '17 at 15:15
  • Your code is a bit confusing. Based on the small data example above, what would be a condition based on `POS` for a row to be selected? – Val Jun 24 '17 at 15:18
  • You'd have to "walk" along the positions and sample one row every 1kb-block. If your POS starts in 0, then you'd have to look for the rows that are inside the range [0,1000[ and sample one row. If there is no row inside this block then move on. If there is just one row keep that. – uller Jun 24 '17 at 15:39

3 Answers3

1

I hope I understood you right:

# needed for round_any()
library(plyr)

res <- lapply(unique(freq$CHR),function(x){
  
  freq_sel <- freq[freq$CHR==x,]
  blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
  do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
  
})

This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.


EDIT:

library(doParallel)
library(foreach)
library(plyr)

cl <-  makeCluster(detectCores())
registerDoParallel(cl)


res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
  
  freq_sel <- freq[freq$CHR==x,]
  blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
  do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
  
}

stopCluster(cl)

This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.

Community
  • 1
  • 1
Val
  • 6,585
  • 5
  • 22
  • 52
  • It works! Thank you. I'm just worried because it took a long time to run, and to do a bootstrap I'll have to run this could 1k times or more. But anyway, it was nice to see your code. I learned a lot from it. – uller Jun 24 '17 at 19:01
  • Do you think we can do something to speed up the code? – uller Jun 24 '17 at 20:35
  • If I run the code like this it'll take 52h to complete. – uller Jun 24 '17 at 20:51
  • Can you share some info about the size/structure of your data? And how many times do you need to iterate? – Val Jun 24 '17 at 20:52
  • The original data.frame has 72525 rows divided into 6 chromosomes. I need to run this code at least 1000 times to get a sense of the correlation between two variables that are in this df. To get a sense of the POS it ranges from 1 to 30 million. – uller Jun 24 '17 at 21:03
  • Thank you for your willingness to help! – uller Jun 25 '17 at 00:00
  • 1
    If you need to walk through all 1kb-blocks from 1 to 30 millions 1000 times, that's a lot of iteration and will take some time i guess. I'm not sure if my approach is the most efficient for that. Using simple parallelisation, I can do an iteration in ~1 min, which is still a lot of time if you need to do it 1000 times. I'm not familiar with the `boot` package, but maybe there's a more efficient way with using it. Have you seen [this question](https://stackoverflow.com/questions/11919808/block-bootstrap-from-subject-list)? Maybe that could be interesting for you – Val Jun 25 '17 at 12:17
  • could you help me implement the parallelization? – uller Jun 25 '17 at 15:39
  • Thank you!! I have already checked boot, and I am not sure it does what I need. – uller Jun 25 '17 at 17:29
1

One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.

In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.

The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.

cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)

Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.

library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)

In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.

Hope that helps, good luck.

Justin

Justin
  • 1,360
  • 12
  • 15
  • Thank you for your reply. I just have a few questions: (i) shouldn't the cor_hat function be as follows: `cor_hat <- function(data, index) cor(y = data$var_A, x = data$var_B)`?. (ii) I don't have to tell tsboot that the column POS should be used for the blocks? (iii) what about the different chromosomes? If I take into account only the POS then things that are in different chromosomes will be mixed together. – uller Jun 24 '17 at 18:49
  • Also, can I pass my data.frame to the tsboot function or do I have to change it to a time-series object? – uller Jun 24 '17 at 19:11
  • It sounds like this is not a time series and the observations aren't completely ordered, and I'm not familiar enough with the domain to know how POS and chromosomes are related. For question (i), you need the index which is the point of the wrapper function around the original. Sorry, my syntax was from another model, which I corrected above and here `cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)` (ii) If its a time series and indexed along POS, tsboot will generate it automatically. But that doesn't appear to be the case. – Justin Jun 24 '17 at 19:20
  • I think there are similarities between time series and genomic data. The POS is the position of an observation in the genome. The genome is divided in chromosomes that are independent to one another. Maybe we should treat each chromosome as a different time series? How do I transform my data into a time series indexing it by POS? – uller Jun 24 '17 at 19:29
  • They are ordered but also there is a structure to it in the sense that inside each chromosome the POS reflects the order of an observation. – uller Jun 24 '17 at 19:31
  • Ok got it. `tsboot` should accept the dataframe as is and will index what ever you give it. Just subset it by the first genome and try it out. Then, you can iterate over all six using the same framework provided by @Val. – Justin Jun 24 '17 at 21:59
  • Just one more thing I thought about. There isn't necessarily one row per position in POS, but POS is ordered (e.g., POS = c(1,2,3,10,15,16,2000,2001,2004). Is that a problem? – uller Jun 24 '17 at 23:58
  • No prob! `tsboot` will still take ordered blocks of the data.frame. – Justin Jun 25 '17 at 00:21
0

So, after a while I came up with an answer to my problem. Here it goes.

You'll need the package dplyr.

l = 1000
teste = freq %>%
  mutate(w = ceiling(POS/l)) %>%
  group_by(CHR, w) %>%
  sample_n(1)

This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.

You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

uller
  • 115
  • 2
  • 10