1

I have two data frames: A with 189 rows and 79 columns, and B with 354 rows and 2 columns. Some of their row names match approximately like below:

A:
Sample       value1     value2    value 3  
10003          a          b        d
10003_Qi1      a          a        c
10003_Qi2      b          a        c
10017          b          g        c
10018          b          f        s
10025_Qi       o          w        c
10040_Qi1      x          y        o
10040_ArT1     e          g        g
10125          p          g        m
10140_Ar1      w          n        c
10225          z          c        p

B:
Sample      first
10003       4
10004       8
10018       45
10025       85
10032       7
10040       54
10140       2
10132       8
10200       65
10324       9
10400       32

I want to merge the two data frames based on an inexact match of their row names, which would result in this:

Sample     value1       value2    value 3    first
10003          a          b        d            4   
10018          b          f        s            45
10025_Qi       o          w        c            85
10040_Qi1      x          y        o            54
10140_Ar1      w          n        c            2

The rule is:

  1. If they match exactly, otherwise

  2. If they match on the first five numbers, with

    a. _Qi1 and _Qi2 in A, the sample in B gets the sample value in A with _Qi1

    b. _Qi1 and _ArT1 in A, the sample in B gets the sample value in A with _Qi1 and merge

Alex A.
  • 5,466
  • 4
  • 26
  • 56
Lili
  • 75
  • 7
  • And you want to do that with which programming language (would guess r but please add tag for clarity). Also, what have you tried so far to achieve this yourself? – TZHX Apr 22 '15 at 08:33
  • I have tried agrep in loop but it did not work. – Lili Apr 22 '15 at 08:37

1 Answers1

1

Here my cumbersome solution:

A<-data.frame(matrix(c(10003,"10003_q1","10007_q1",10008,1,2,3,2,4,3,1,2),4,3))
colnames(A)<-c("sample","value1","value2")

#     sample value1 value2
# 1    10003      1      4
# 2 10003_q1      2      3
# 3 10007_q1      3      1
# 4    10008      2      2

B<-data.frame(matrix(c(10003,10004,10007,10009,4,8,45,85),4,2))
colnames(B)<-c("sample","first")

#   sample first
# 1  10003     4
# 2  10004     8
# 3  10007    45
# 4  10009    85

# step 1: adapt both dataframes
A$first<-NA
A$sample2<-strtrim(A$sample,5)
B$sample<-as.factor(B$sample)

# step 2: work down table A merging values from table B
# note: this assumes that B$sample is unqiue

for(i in 1:NROW(A)){
  ind<-A$sample2[i]==B$sample
  if(sum(ind)!=0){ # makes sure a value was found
  A[i,"first"]<-B$first[ind]
  }
}

# step 3: remove any duplicates of A$sample2
# note: this assumes that the 5 digit number will always come before the number+extension

A<-A[!duplicated(A$sample2),]

#      sample value1 value2 first sample2
# 1     10003      1      4     4   10003
# 3  10007_q1      3      1    45   10007
# 4     10008      2      2    NA   10008
steinbock
  • 726
  • 1
  • 12
  • 25
  • but there is a point: in rule 2b we need the sample value with extension which starts with Q, but here those are chosen with A. – Lili Apr 22 '15 at 11:13