manipulating database in R

Question

I'm quite new in R and I want to do something with my data in R. can anybody help me how to implement this in R ?

I have data matrix(mydata1) like as follow, and I want to add second columns for it from second database.

my first data matrix is like :

> mydata1[1:4,1:3]

 Gene ID              lung.cancer lung.cancer.1 lung.cancer.2
hsa-miR-616*      3.653241       1.00000      1.838179
hsa-miR-1296     2.688751      36.12798     43.823880
hsa-miR-338-5p   29.893947      2.21830     48.048856
hsa-miR-452*    5.693279    1015.58508   35.165157
>

and my second database is like :

> Database 

      ENS ID           Gene ID

    ENSG00000221263   hsa-mir-548p
    ENSG00000207941   hsa-miR-616
    ENSG00000207800   hsa-mir-504
    ENSG00000222831   hsa-mir-1537
    ENSG00000207582   hsa-mir-30b
    ENSG00000199153   hsa-miR-338-5p
    ENSG00000215998   hsa-mir-935
    ENSG00000207804   hsa-mir-599

I want add new column after Gene ID called ENS ID for my first data matrix (mydata1), such that, it take Gene ID from mydata1 and search for it in Database and if it's found, add it's corresponding ENS ID to mydata1 in a new columns .

The expected output would look like :

  Gene ID            ENS ID            lung.cancer lung.cancer.1 lung.cancer.2
    hsa-miR-616*     ENSG00000207941   5.653241       1.00000      1.838179
    hsa-miR-1296                       7.688751      36.12798     3.823880
    hsa-miR-338-5p   ENSG00000199153   29.893947      42.21830     8.048856
    hsa-miR-452*                       52.693279    115.58508   15.165157

The answer is the `merge` function. There must be hundreds of similar question on SO. — IRTFM, Nov 05 '13 at 22:43
@user2806363 "duplicate" has a much more liberal definition here than in the real world. — Señor O, Nov 05 '13 at 22:46
@Dwin `merge` as in the accepted answer's first solution in the question I posted. — Thomas, Nov 05 '13 at 22:55
Yes. Progressive application of cluestick apparently needed. — IRTFM, Nov 05 '13 at 23:19
I suggest modification to the question's title. Manipulating a database in R seems to cover what I spend a third of my time doing. — Mark Miller, Nov 05 '13 at 23:40

score 2 · Answer 1 · answered Nov 05 '13 at 23:17

I doing this out of frustration with @user2806363's inability to read for meaning.

> mydata1[,1] <- sub("\\*","",mydata1[,1])
> dput(mydata1)
structure(list(Gene_ID = c("hsa-miR-616", "hsa-miR-1296", "hsa-miR-338-5p", 
"hsa-miR-452"), lung.cancer = c(3.653241, 2.688751, 29.893947, 
5.693279), lung.cancer.1 = c(1, 36.12798, 2.2183, 1015.58508), 
    lung.cancer.2 = c(1.838179, 43.82388, 48.048856, 35.165157
    )), .Names = c("Gene_ID", "lung.cancer", "lung.cancer.1", 
"lung.cancer.2"), row.names = c(NA, -4L), class = "data.frame")
> dput(Database)
structure(list(ENS_ID = structure(c(7L, 5L, 3L, 8L, 2L, 1L, 6L, 
4L), .Label = c("ENSG00000199153", "ENSG00000207582", "ENSG00000207800", 
"ENSG00000207804", "ENSG00000207941", "ENSG00000215998", "ENSG00000221263", 
"ENSG00000222831"), class = "factor"), Gene_ID = structure(c(5L, 
7L, 4L, 1L, 2L, 3L, 8L, 6L), .Label = c("hsa-mir-1537", "hsa-mir-30b", 
"hsa-miR-338-5p", "hsa-mir-504", "hsa-mir-548p", "hsa-mir-599", 
"hsa-miR-616", "hsa-mir-935"), class = "factor")), .Names = c("ENS_ID", 
"Gene_ID"), class = "data.frame", row.names = c(NA, -8L))

> merge(mydata1, Database)
         Gene_ID lung.cancer lung.cancer.1 lung.cancer.2          ENS_ID
1 hsa-miR-338-5p   29.893947        2.2183     48.048856 ENSG00000199153
2    hsa-miR-616    3.653241        1.0000      1.838179 ENSG00000207941

> merge(mydata1, Database, all.x=TRUE)
         Gene_ID lung.cancer lung.cancer.1 lung.cancer.2          ENS_ID
1   hsa-miR-1296    2.688751      36.12798     43.823880            <NA>
2 hsa-miR-338-5p   29.893947       2.21830     48.048856 ENSG00000199153
3    hsa-miR-452    5.693279    1015.58508     35.165157            <NA>
4    hsa-miR-616    3.653241       1.00000      1.838179 ENSG00000207941
>

score -1 · Answer 2 · answered Nov 05 '13 at 22:56

Assuming your class(database) = matrix, class(mydata) = matrix, and all columns are character class,

temp=numeric(nrow(Database))    

for( i in 1:nrow(Database)){
  ind=which(Database[,2]==mydata[,1])

  if(length(ind)!=0){

    temp[ind]=Database[,1][ind]


  }
}
cbind(mydata[,1],temp,mydata[,2],mydata[,3])

will provide what you are looking for

manipulating database in R

2 Answers2