5

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position. enter image description here

I have used the following R code to generate it (from the gap package) :

require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273, 
     24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green",          "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
  chr pos          p
1   1   1 0.79296584
2   1   2 0.96675136
3   1   3 0.43870076
4   1   4 0.79825513
5   1   5 0.87554143
6   1   6 0.01207523

I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .

agatha
  • 1,513
  • 5
  • 16
  • 28
  • What do you mean by "zero-crossings"? – Paul Hiemstra Feb 25 '13 at 13:57
  • "first derivative of a peak has a downward-going zero-crossing at the peak maximum"- from here: http://terpconnect.umd.edu/~toh/spectrum/PeakFindingandMeasurement.htm#findpeaks ; I am not sure if I have understood correctly – agatha Feb 25 '13 at 14:03
  • @agatha, is this different from the question you asked on [**biostars**](http://www.biostars.org/p/64416/#64558) for which I provided an answer..? – Arun Feb 25 '13 at 14:34
  • @Arun - Nope it isn't...I have planned to use the approach you have suggested for significant/non-significant differentiation of SNPs after selecting the tops and then perhaps use a Thesias test to check if they come from one association signal..I have never worked with SNP/chip data and i am not sure which is the specific way to do it to make sure that I am eliminating the noise and focus only on the peaks. – agatha Feb 25 '13 at 14:37
  • @agatha, those peak findings by first derivative, iiuc, are applied to smooth/continuous functions (where derivative exists). I'm not quite sure if the same applies. Could you explain a bit more (by editing your post) what your end goal is..? – Arun Feb 25 '13 at 14:41
  • @ Arun - that is the problem- by smoothing -log(p) and generating the first degree derivative of that will generate 0 ..as they are all constants and there is no function to derivate. What I want is to select the peaks that are coming from the same signal and then perform a separate analysis on those SNPs..so I just need the peak coordinates from the plot and then perform the analysis that you have suggested. – agatha Feb 25 '13 at 14:43
  • what's the problem with choosing an arbitrary threshold? Or for that matter, a quantile based approach as shown by paul. – Arun Feb 25 '13 at 14:56
  • An arbitrary threshold gives me 4500 points.. I guess I should just continue with the analysis from here.. – agatha Feb 25 '13 at 14:59

2 Answers2

1

If you want the indices of the values above the 99th percentile:

# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)

...and get the values from the original data test:

peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
    chr pos           p    log_p
5     1   5 0.002798126 2.553133
135   1 135 0.003077302 2.511830
211   1 211 0.003174833 2.498279
586   1 586 0.005766859 2.239061
598   1 598 0.008864987 2.052322
790   1 790 0.001284629 2.891222

You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:

How to calculate first derivative of time series

after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.

Community
  • 1
  • 1
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Ok, this is working...I have tried it on my data as well ...I couldn't see the forest behind the trees. Thanks. – agatha Feb 25 '13 at 14:13
  • However, I still cannot distinguish which one of them are peaks or not ...so from this result I would need an extra filter..the p-value is not enough as i do not know which are the points that I am looking for.. – agatha Feb 25 '13 at 14:15
  • 1
    I regularly recommend the function `pastecs::turnpoints` for finding (reasonably smooth) local maxima and minima. – Carl Witthoft Feb 25 '13 at 15:38
0

Based on my experience after plotting the graph you can use following R code to find the peak coordinate

plot(x[,1], x[,2])

identify(x[,1], x[,2], labels=row.names(x))

note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value

at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate

coords <- locator(type="l")

coords

Community
  • 1
  • 1
akhan
  • 17
  • 2