-3

I have a big data.frame with genomic data. The data looks like this- colnames(df)=c("id","chr","start","end","log2") where id is the sample name, chr is the number of the chromosome, start and end give me the location on the chromosome, and log2 is how high/low was the read in that position.

Because there is a lot of data, and it's hard to understand what's going on, I'm trying to go over each sample (id), and for each chromosome (chr) I want to calculate the median of log2 in segments, let say all the reads that are between 1 to 10^7, 1+10^7 to 2*10^7 and so on.

the result should be a new data.frame, for each sample and each chromosome I should have several rows, with start and end indicating what segment I am in, and the last value will be the median of that segment.

I think I need to use tapply() and go over samples, and in it tapply() and go over the chromosomes, and then in each chromosome, a loop over "start" position? (let say I care only if the start coordinate is in the range) Not sure exactly how to approach this.

Any hints, tips, directions will be very appreciated.

Reproducible example-

# fabricated data, 4 samples
# 24 chromosomes in each sample
# 61 ranges in each chromosome

df <- data.frame(id = rep(c('F1','F2','M1','M2'), each = 24*61), 
                 chr = rep(rep(c(1:22,'x','y'), each = 61),4),
                 start = rep(seq(1,25*10^6 - 99, length.out = 61),times = 24*4),
                 end = rep(seq(100,25*10^6, length.out = 61),times = 24*4),
                 log2 = rnorm(4*24*61))

# output should look something like this-
id      chr     start    end       median_log_2
"F1"    "1"     1        8000000   0.002
"F1"    "1"     8000001  16000000  0.00089
"F1"    "1"     16000001 24000000  -0.0011
"F1"    "1"     24000000 25000000  0.108
"F1"    "2"     1        8000000   -0.0012
"F1"    "2"     8000001  16000000  0.0089
"F1"    "2"     16000001 24000000  0.00311
"F1"    "2"     24000000 25000000  0.0128
...
...
T.G.
  • 743
  • 3
  • 6
  • 27
  • Instead of describing with words, just post a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and expected output. – Sotos Jan 02 '17 at 08:58
  • I'll add a reproducible example (like you like to ask), but I really don't think this will make anything more clear. – T.G. Jan 02 '17 at 09:37
  • What do you mean "like I like to ask"??? You think it is a personal choice? or that I am trying to 'bully' you? I am trying to help you (well was) and reproducible example and expected output makes it easier for me (and others) to help you. You have asked 16 questions here in SO. I should NOT have to tell you things like these and you should NOT make comments like that. – Sotos Jan 02 '17 at 09:38
  • not you specifically, but I go around here a little bit, and I might be mistaken but in my eyes there is definitely a tendency to ask for a reproducible example in a lot of questions, even if the reproducible example doesn't make anything more clear. – T.G. Jan 02 '17 at 09:44
  • I'm really sorry if I offended you, I really didn't mean it to sound like that.. – T.G. Jan 02 '17 at 09:45
  • You said "like YOU like to ask". And why do you think your question did not receive any answers? or any attention? Because it is very clear?? NO. Because reproducible example is missing! People won't even read questions without reproducible examples!!! Anyway. I am done with this. Good luck – Sotos Jan 02 '17 at 09:46
  • We can try with `dplyr` i..e. `df %>% group_by(id, chr, start = cut(start, breaks = c(1, 8000001, 16000001, 24000000)), end = cut(end, breaks = c(8000000, 16000000, 24000000, 25000000))) %>% summarise(median_log2 = median(log2)) %>% na.omit()` – akrun Jan 02 '17 at 10:10
  • Thank you. I'll look into this. – T.G. Jan 02 '17 at 11:00

1 Answers1

0
median_data <- tapply(df$log2, 
                      list(df$id, 
                           df$chr, 
                           cut(df$start, c(0,8*10^6,1.6*10^7,2.4*10^7,3.2*10^7,4*10^8))),
                      median)
median_data <- as.data.frame.table(median_data)

Did the job. (The output is not in the right format, but it's close enogh for me)

In tapply(), you can subset by more than one argument, using list().

T.G.
  • 743
  • 3
  • 6
  • 27