Data selection error

Question

There is an error when i try to select data from large dataset(d1) based on small dataset(d2). Below is my script and the problem.

**d1 <- read.table("MSv25.txt",header=T)

d2 <- read.table("Flairall.Gene.txt",header=T)

d2$low <- d2$start-10000 ; d2$high <- d2$end+10000

d1$matched <- apply(d1,1,function(p) which(p['POS'] >=d2[,'low'] & p['POS'] <= d2[,'high'] & p['CHR']==d2[,"chromosome"]))

d3 <- cbind(d1[which(d1$matched >0),], d2[unlist(d1$matched[which(d1$matched>0)]),])

write.table(d3,file="Flairall.GOBSgene.txt",quote=FALSE,sep="\t",row.names=FALSE,col.names=TRUE)**

The d1 is like this:

SNP CHR POS A1  A2  OR  P
rs1000007   2   237416793   C   T   0.9785  0.4868
rs1000003   3   99825597    G   A   0.9091  0.009774
rs1000002   3   185118462   C   T   1.0111  0.6765
rs10000012  4   1347325 G   C   1.0045  0.9087
rs10000042  4   5288053 T   C   1.0622  0.3921
rs10000062  4   5305645 G   C   1.0116  0.779
rs10000132  4   7450570 T   C   0.9734  0.4748
rs10000081  4   16957461    C   T   1.0288  0.3585
rs10000100  4   19119591    A   G   1.0839  0.1417
rs10000010  4   21227772    C   T   0.971   0.2693
rs10000092  4   21504615    C   T   1.0342  0.27
rs10000068  4   36600682    T   C   1.055   0.103
rs10000013  4   36901464    C   A   1.0198  0.5379
rs10000037  4   38600725    A   G   1.0249  0.4217
rs10000017  4   84997149    T   C   0.9576  0.1912
rs10000109  4   91586292    A   T   0.9956  0.9349
rs10000023  4   95952929    T   G   0.9998  0.9951
rs10000030  4   103593179   A   G   1.0839  0.04208
rs10000111  4   107137517   A   G   1.0812  0.3128
rs10000124  4   109325900   A   C   1.0642  0.1906
rs10000064  4   128029071   C   T   1.0388  0.1578
rs10000029  4   138905074   C   T   0.7832  0.14
rs10000036  4   139438712   T   C   0.9848  0.5683
rs10000033  4   139819348   C   T   0.9918  0.7542
rs10000121  4   157793485   A   G   1.0008  0.9769
rs10000041  4   165841405   G   T   1.0042  0.9146
rs10000082  4   167529642   T   C   0.9733  0.6612

The d2 is like this:

gene    start   end chromosome
WFDC9   237416000   237418000   2
SRGAP3  19119590    21504615    4

In general, i want to select the SNPs within the genes by extending a window of 10kb at the start and end position .

Here is my script results:

   SNP CHR       POS A1 A2     OR      P matched  gene     start       end chromosome       low      high
    1 rs1000007   2 237416793  C  T 0.9785 0.4868       1 WFDC9 237416000 237418000          2 237406000 237428000

Which is not correct, as the there is one gene missing. The correct one should be:

gene    start   end chromosome  SNP CHR POS A1  A2  OR  P
WFDC9   237416000   237418000   2   rs1000007   2   237416793   C   T   0.9785  0.4868
SRGAP3  19119590    21504615    4   rs10000100  4   19119591    A   G   1.0839  0.1417
SRGAP3  19119590    21504615    4   rs10000010  4   21227772    C   T   0.971   0.2693
SRGAP3  19119590    21504615    4   rs10000092  4   21504615    C   T   1.0342  0.27

Could anyone help me to point out what is wrong.... Many thanks...

This is the third time you've asked this question: http://stackoverflow.com/questions/18573957/in-large-dataset-deleted-selected-dataset-based-on-another-small-daseset http://stackoverflow.com/questions/18157323/using-r-to-select-data-based-on-another-dataset Why were these answers insufficient, and what have you tried since? — Scott Ritchie, Sep 16 '13 at 07:32

Scott Ritchie · Accepted Answer · 2013-09-17T06:36:59.837

2

Your code seems to be completely backwards to what you're trying to achieve:

"For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?"

First of all, your code for d1$matched is backwards. All your p's and d2s should be the other way round (currently it doesn't make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).

I would approach it the way i've phrased your question:

cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
  # Broken down into steps for readability.
  inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
  inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
  inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
  # Now we have the cis-SNPs, so lets build the data.frame for this gene,
  # and grow our data.frame d3:
  if (nrow(inCis) > 0) {
    d3 <- rbind(d3, cbind(d2[i,], inCis))
  }
}

I tried to find a solution which didn't involve growing d3 in the loop, but because you're attaching each row of d2 to 0 or more rows from d1 I wasn't able to come up with a solution that's not horribly inefficient.

edited Sep 17 '13 at 06:36

answered Sep 16 '13 at 08:23

Scott Ritchie

10,293
3
28
64

I've fixed the code so that it won't crash for genes who have 0 cis-SNPs. – Scott Ritchie Sep 17 '13 at 06:38
2

@user2669497, this may do the trick, but depending on the number of SNPs, genes and your genome, this may be (very) inefficient. The algorithm you're looking for is [Interval tree](http://en.wikipedia.org/wiki/Interval_tree). The package you should be using is [GenomicRanges](http://bioconductor.org/packages/2.12/bioc/html/GenomicRanges.html). – Arun Sep 18 '13 at 08:15
Huh, I didn't know about that package! I spawned a new discussion due to my frustration formulating an answer here: http://stackoverflow.com/questions/18840410/efficiently-merging-two-data-frames-on-a-non-trivial-criteria – Scott Ritchie Sep 18 '13 at 15:08
An `sqldf` query was able to give me all cis-SNPs (+/- 10Kb range) for 35k genes and 2 million markers in just under 30 minutes. – Scott Ritchie Sep 18 '13 at 15:09

Data selection error

1 Answers1

Linked