1

I would like to merge/combine two files, so that if an entry in column B of my first file falls into the range of columns B and C in my second file, the output will contain all of the information contained in both files.

Like this:

file 1

     A      B
    rs10    23353
    rs100   10000
    rs234   54440

file 2

    A        B      C
    E235    20000   30000
    E255    50000   60000

where, because 23353 falls between 20000 and 30000, and 54440 falls between 50000 and 60000 the output file would look like this:

rs      23353   E235    20000   30000
rs234   54440   E255    50000   60000

and rs100 would be removed (from the output file) because it does not fall within a range of any row in file 2.

any suggestions?

Thank you!

mfk534
  • 719
  • 1
  • 9
  • 21

3 Answers3

1

Something like this should do the trick. You could probably make it more concise, but to elucidate all the steps I made it overly obvious.

NewMatrixCol1 <- c()
NewMatrixCol2 <- c()
NewMatrixCol3 <- c()
NewMatrixCol4 <- c()
NewMatrixCol5 <- c()

for (i in 1:length(file1$A)) {
    for (j in 1:length(file2$A)) {
        LowNumber <- file2$B[j]
        HighNumber <- file2$C[j]
        if (LowNumber <= file1$B[i] & file1$B[i]  <= HighNumber) {
            append(NewMatrixCol1, file1$A[i])
            append(NewMatrixCol2, file1$B[i])
            append(NewMatrixCol3, file2$A[j])
            append(NewMatrixCol4, file2$B[j])
            append(NewMatrixCol5, file2$C[j])
        } else {}
    }
}

dataframe <- data.frame(Col1 = NewMatrixCol1, Col2 = NewMatrixCol2, Col3 = NewMatrixCol3, Col4 = NewMatrixCol4, Col5 = NewMatrixCol5)

EDIT1: I misunderstood the question, and am now working on it.

EDIT2: This new solution should work as indicated.

EDIT3: There was a missing ), as indicated by mfk534.

MikeZ
  • 345
  • 1
  • 4
  • 15
  • I appreciate the overly obvious aspect ... conciseness can come later. Thanks, Mike! – mfk534 Aug 07 '12 at 19:12
  • 1
    Just also, FYI for anyone else who comes across this and might repurpose - the first line in Mike's code needs a closing ) .. as in, for (i in length(file1$A)) { – mfk534 Aug 07 '12 at 19:21
  • I've added the closing `)` that was missing. You are quite welcome, hope this works well for you! – MikeZ Aug 07 '12 at 19:26
  • @MikeZ, can you double check the answer you've posted here? As it is, it's not working for me. One mistake I see is in `Col4` and `Col5` in the last line. – A5C1D2H2I1M1N2O1R2T1 Aug 07 '12 at 19:58
  • It ran with my data but produced an empty file - I fiddled with a few things and am now getting an error: Error in if (LowNumber <= file1$B[i] & file1$B[i] <= HighNumber) { : argument is of length zero – mfk534 Aug 08 '12 at 14:00
  • Try converting the initial matrix to a data frame by using `as.data.frame()`. There might be an indexing error in this case, as the convention I was using was for a data frame. – MikeZ Aug 08 '12 at 18:13
  • error message stays the same: `Error in if (LowNumber <= file1$B[i] & file1$B[i] <= HighNumber) { : argument is of length zero ` – mfk534 Aug 08 '12 at 19:02
  • Okay, I added `1:` to the `for` loops. This should solve the problem. Initially it was `in length(...)`, now it is `in 1:length(...)`. – MikeZ Aug 09 '12 at 22:11
1

I see you've already accepted an answer, but here is another possible solution.

This function was just hacked together, but could be worked on some more to be made more generalized.

myfun = function(DATA1, DATA2, MATCH1, MIN, MAX) {
  temp = sapply(1:nrow(DATA2), 
                function(x) DATA1[[MATCH1]] >= DATA2[[MIN]][x] &
                  DATA1[[MATCH1]] <= DATA2[[MAX]][x])
  if (isTRUE(any(rowSums(temp) == 0))) {
    temp1 = DATA1[-(which(rowSums(temp) == 0)), ]
  }
  OUT = cbind(temp1[order(temp1[[MATCH1]]), ], 
              DATA2[order(DATA2[[MIN]]), ], row.names=NULL)
  condition = ((OUT[4] <= OUT[2] & OUT[2] <= OUT[5]) == 0)
  if (isTRUE(any(condition))) {
    OUT[-which(condition), ]
  } else {
    OUT
  }
}

Here's what the function does:

  1. It first compares, row by row, the value in the second column of the first data.frame with the values in the second and third columns of the second data.frame.
  2. It then checks to find if any of those has FALSE for both conditions, and removes them from the first data.frame.
  3. Then, it sorts the first data.frame by the second column, and the second data.frame by the "min" match column.
  4. Finally, it does one more check to ensure that the value from the first dataset is between the provided values; if not, that row is removed.

Now, here is some sample data. A and B are the same as your provided data. X and Y have been changed for further testing purposes. In the merge between X and Y, there should be only one row.

A = read.table(header=TRUE, text="A      B
    rs10    23353
    rs100   10000
    rs234   54440")

B = read.table(header=TRUE, text="A        B      C
    E235    20000   30000
    E255    50000   60000")

X = A[c(3, 1, 2), ]
X[1, 2] = 57000
Y = B
Y[2, 3] = 55000

Here's how you would use the function and the output you would get.

myfun(A, B, 2, 2, 3)
#       A     B    A     B     C
# 1  rs10 23353 E235 20000 30000
# 2 rs234 54440 E255 50000 60000
myfun(X, Y, 2, 2, 3)
#      A     B    A     B     C
# 1 rs10 23353 E235 20000 30000
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Hi - so with your solution I'm having an issue because the two files I'm using are of different lengths: `Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 438, 21976` Do you have any suggestions? – mfk534 Aug 09 '12 at 21:03
1

UPDATE: This question was more complicated than indicated here. The solution can be found here: Merge by Range in R - Applying Loops, and is delivered by using the GenomicRangespackage in Bioconductor. Very useful package!

Community
  • 1
  • 1
mfk534
  • 719
  • 1
  • 9
  • 21