-1

I have two datasets which both share a common ID variable, and also share n variables which are denoted SNP1-SNPn. An example of the two datasets is shown below

Dataset 1

ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1   0    1    1    0    0    0    0
2   1    1    0    0    0    0    0
3   1    0    0    0    1    1    0
4   0    1    1    0    0    0    0
5   1    0    0    0    1    1    0
6   1    0    0    0    1    1    0
7   0    1    1    0    0    0    0

Dataset 2

ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1  0.65 1.3  2.8  0.43 0.62 0.9  1.5
2  0.74 1.6  3.4  0.9  2.4  4.4  2.3
3  0.28 0.5  5.7  6.7  0.3  2.5  0.56
4  0.74 1.6  3.4  0.9  2.4  4.4  2.3
5  0.65 1.3  2.8  0.43 0.62 0.9  1.5
6  0.74 1.6  3.4  0.9  2.4  4.4  2.3
7  0.28 0.5  5.7  6.7  0.3  2.5  0.56

I would like to multiply each value in a given position in dataframe 1, with the value in the equivalent position in dataframe 2.

For example, I would like to multiple position [1,2] in dataset 1 (value = 0), by position [1,2] in dataset 2 (value = 0.65). My data set is very large and spans almost 300 columns and 500,000 IDs.

Variable names for SNP1-n are longer in reality (for example they actually read Affx.5869593), so I cannot just use SNP1-300 in my code, it would have to be specified by the number of columns.

Do I need to unlist both datasets by person ID and SNP name first? What function can be used for multiplying values within two datasets?

2 Answers2

0

I am assuming that you are trying to return a third dataframe which will have, in each position, the product of the values that were in that position in the two data frames.

For example, if the following are your two dataframes

df1 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1a = c(0, 1, 1, 0, 1
), SNP2a = c(1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, 
-5L))

ID  SNP1a  SNP2a
1     0     1
2     1     1
3     1     0
4     0     1
5     1     0

df2 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1b = c(0.65, 0.74, 
0.28, 0.74, 0.65), SNP2b = c(1.3, 1.6, 0.5, 1.6, 1.3)), class = . 
"data.frame", row.names = c(NA, -5L))

ID SNP1b SNP2b
1  0.65   1.3
2  0.74   1.6
3  0.28   0.5
4  0.74   1.6
5  0.65   1.3

Then

df3 <- df1[,2:3] * df2[,2:3]

   SNP1   SNP2
1  0.00   1.3
2  0.74   1.6
3  0.28   0.0
4  0.00   1.6
5  0.65   0.0

Will work (As long as the two dataframes are of equivalent size).

  • In the first portion, would I need to type out each ID and SNP value? I have over 800,000 data values so I'm looking for something where I wouldn't need to specify SNPs and IDs by name – Talia Delamare Jul 03 '18 at 13:03
  • @TaliaDelamare, no, as discussed in more detail by the other answer. You simply need to provide the range of 'SNP' columns that are equal for both dataframes. For example, if dataframe-1 has 300 columns (ID column + 299 'SNP' columns), and dataframe-2 also has 300 columns (ID column + 299 'SNP' columns) then the code above `df3 <- df1[ , 2:300] * df2[ , 2:300]` will work. Remember that when 'subsetting' dataframes, it uses the notation `[row,columns]` and 2:300 will give you each column from the 2nd to the 300th. Note that I have assumed ID variable is column 1 in both dataframes. –  Jul 03 '18 at 13:23
  • @TaliaDelamare if either of the two answers in this post have solved your problem. Please mark the one that has with the tick. Thanks. –  Jul 03 '18 at 14:08
  • @TaliaDelamare You may also find it useful to use the [question wizard](https://stackoverflow.com/questions/ask/wizard) to structure your questions to make sure that they are not too open to interpretation, and that they are reproducible. You may also want to look into [dput()](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Thanks –  Jul 03 '18 at 14:14
0

If your data frames have identical set of id's and they are the same size, you could sort both for id and do this:

df <- data.frame(
  id = c(1,2,3,4,5),
  snp1 = c(0,0,1,0,0),
  snp2 = c(1,1,1,0,1)
)

df2 <- data.frame(
  id <- c(1,2,3,4,5),
  snp1 <- c(0.3,0.2,0.3,0.1,0.2),
  snp2 <- c(0.5,0.8,0.2,0.3,0.3)

)


res <- mapply(`*`, df[,-1], df2[,-1)
res$id <- df$id
  • Would I need to manually type out each value in order for this to work? I can't type it out manually as I have over 800,000 data points. – Talia Delamare Jul 03 '18 at 12:46
  • Of course you don't have to, i have assumed that you have the data frames already read in R and just don't know how to proceed. You didin't provided reproducible example so I have provided one for myself in order to show how it works. You just have to sort both dataframes by id (use dplyr::arrange) and sort columns so that they are in the same order. Mapply will take first column from first data frame and multiply it by first from the second, and so for the rest of columns. `*` is a vectorized function, so if ids will be also in the same order in both data frames, it will give the results. – Paweł Kozielski-Romaneczko Jul 03 '18 at 12:59