Data.frame Merge and Selection of values which are common in 2 Data.frames

Question

I have 3 data.frames

> head(ON1)
  Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1     11302      NM_007377      Aatk    137.48
2     11303      NM_013454     Abca1    118.09
3     11305      NM_007379     Abca2     93.56
4     11306      NM_009592     Abcb7     92.42
5     11308      NM_007380      Abi1    410.73
6     11356      NM_009598      Abl5    149.46

 > head(ON2)
  Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1     11303      NM_013454     Abca1     86.02
2     11305      NM_007379     Abca2    103.45
3     11306      NM_009592     Abcb7     95.32
4     11308      NM_007380      Abi1    313.85
5     11350      NM_009594      Abl1    116.24
6     11352      NM_009595      Abl2    155.76

> head(ON3)
  Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1     11303      NM_013454     Abca1     69.49
2     11305      NM_007379     Abca2     82.02
3     11306      NM_009592     Abcb7     83.16
4     11308      NM_007380      Abi1    306.44
5     11350      NM_009594      Abl1    150.37
6     11355      NM_009599      Abl4    154.93

Some rows are unique to only 1 data.frame (e.g. row1 of ON1), some are common in 2 data.frames (e.g. row5 of ON2 and ON3, this row doesn't exist in ON1) and some are common in all data.frames (e.g. row2 of ON1 which is row1 in ON2 and ON3). The only difference is the values in last column Tag.Count

I want to merge all 3 data.frames in such a way that I obtain only those rows in my final data.frame which are common among at-least 2 data.frames and the value of Tag.Count which is higher among them will be assigned to that row.

> head(F)
  Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1     11303      NM_013454     Abca1     118.09
2     11305      NM_007379     Abca2     103.45
3     11306      NM_009592     Abcb7     95.32
4     11308      NM_007380      Abi1    410.73
5     11350      NM_009594      Abl1    150.37

Here you see that rows with Entrez.ID = 11302 is removed as it only appeard once among all the data.frames and those rows which were common in at least 2 data.frames are appearing here but the Tag.Count score which was maximum among all the data.frames is assigned to that row.

UPDATE

How to take average of rows. After merging above three datasets, instead of just keeping one row which has maximum Tag.count value, I want to sum the Tag.count value and divide by total number of rows having the same Entrez.ID. In fact the values in the first 3 columns are the same, difference only arises in last column. Example:

> head(d)
      Entrez.ID Nearest.Refseq Gene.Name Tag.Count
1         11302      NM_007377      Aatk    137.48
2         11303      NM_013454     Abca1    118.09
7886      11303      NM_013454     Abca1     86.02
15407     11303      NM_013454     Abca1     69.49
3         11305      NM_007379     Abca2     93.56
7887      11305      NM_007379     Abca2    103.45

So, in this case as 3 rows have Entrez.ID = 11303, the Tag.count values will be summed up (118.09 + 86.02 + 69.49) and divided by 3 and final output will contain only 1 row having Entrez.ID 11303 and Tag.Count value = Sum/no.of Rows

score 3 · Accepted Answer · answered Feb 18 '16 at 16:33

3

Here is a way with combining the three data frames. After combining all three, we find the values that appear more than once. With that index we can aggregate the data frame with the function max:

d <- do.call(rbind, list(ON1, ON2, ON3))
d1 <- do.call(paste, d[1:3])
tbl <- table(d1) > 1L
indx <- d1 %in% names(tbl[tbl])
aggregate(Tag.Count ~., d[indx,], FUN=max)
#   Entrez.ID Nearest.Refseq Gene.Name Tag.Count
# 1     11303      NM_013454     Abca1    118.09
# 2     11305      NM_007379     Abca2    103.45
# 3     11306      NM_009592     Abcb7     95.32
# 4     11308      NM_007380      Abi1    410.73
# 5     11350      NM_009594      Abl1    150.37

answered Feb 18 '16 at 16:33

Pierre L

28,203
6
47
69

Thanks @Pierre It did the job – user3253470 Feb 18 '16 at 16:41
@Pierre I have an update to the question, can you help me solving this problem. Thanks. – Newbie Apr 07 '16 at 08:01
@Newbie, regarding your update, have you tried replacing `FUN=max` by `FUN=mean`?... – Cath Apr 07 '16 at 13:14
@Newbie use `aggregate(Tag.Count~., d, mean, na.rm=TRUE)` – Pierre L Apr 07 '16 at 15:05

score 1 · Answer 2 · answered Feb 18 '16 at 16:30

1

You can do this in dplyr by binding the three together, filtering out groups with one element, and then picking the top Tag.Count in each group.

library(dplyr)

F <- bind_rows(ON1, ON2, ON3) %>%
  group_by(Entrez.ID) %>%    # elements are in same group if same Entrez.ID
  filter(n() > 1) %>%        # filter out groups with 1 element
  top_n(1, Tag.Count)        # pick highest Tag.Count from each

answered Feb 18 '16 at 16:30

David Robinson

77,383
16
167
187

Thanks @DavidRobinson – user3253470 Feb 18 '16 at 16:42
@David I have an update to the question, can you help me solving this problem. Thanks. – Newbie Apr 07 '16 at 08:02

m-dz · Answer 3 · 2016-04-08T12:19:21.497

0

Regarding the update:

preliminaries

tab <- structure(
  list(
    Entrez.ID = c(11302L, 11303L, 11303L, 11303L, 11305L, 11305L),
    Nearest.Refseq = structure(c(1L, 3L, 3L, 3L, 2L, 2L),
                               .Label = c("NM_007377", "NM_007379", "NM_013454"),
                               class = "factor"),
    Gene.Name = structure(c(1L, 2L, 2L, 2L, 3L, 3L),
                          .Label = c("Aatk", "Abca1", "Abca2"),
                          class = "factor"),
    Tag.Count = c(137.48, 118.09, 86.02, 69.49, 93.56, 103.45)
  ),
  .Names = c("Entrez.ID", "Nearest.Refseq", "Gene.Name", "Tag.Count"),
  class = "data.frame",
  row.names = c("1", "2", "7886", "15407", "3", "7887")
)
print(tab)
#       Entrez.ID Nearest.Refseq Gene.Name Tag.Count
# 1         11302      NM_007377      Aatk    137.48
# 2         11303      NM_013454     Abca1    118.09
# 7886      11303      NM_013454     Abca1     86.02
# 15407     11303      NM_013454     Abca1     69.49
# 3         11305      NM_007379     Abca2     93.56
# 7887      11305      NM_007379     Abca2    103.45

`dplyr` way

library(dplyr)
(res <- tab %>% group_by(Entrez.ID) %>% filter(n() > 1) %>% summarise(Means = mean(Tag.Count)))

Result:

  Entrez.ID   Means
      (int)   (dbl)
1     11302 137.480
2     11303  91.200
3     11305  98.505

Pure `data.table` way, after David Arenburg comment

library(data.table)
(res <- setDT(tab)[, if(.N > 1) { .(Means = mean(Tag.Count)) }, by = Entrez.ID])

Result same as above.

Leave `Nearest.Refseq` and `Gene.Name` columns

Two possible dplyr solutions, pick one:

Assuming Nearest.Refseq and Gene.Name are unique for every Gene.Name:

res <- tab %>% group_by(Entrez.ID) %>% summarise(Nearest.Refseq = Nearest.Refseq[1], Gene.Name = Gene.Name[1], Means = mean(Tag.Count))

If not you need to do something with them (replace someFunction()!):

res <- tab %>% group_by(Entrez.ID) %>% summarise(Nearest.Refseq = someFunction(Nearest.Refseq), Gene.Name = someFunction(Gene.Name), Means = mean(Tag.Count))

Edit: Mixed data.table/dplyr approach removed.

edited Apr 08 '16 at 12:19

answered Apr 07 '16 at 10:00

m-dz

2,342
17
29

This is not the "*data.table way*". `data.table` way would be just `setDT(tab)[, .(Means = mean(Tag.Count)), by = Entrez.ID]`. I've never seen a single pipe in any `data.table` tutorial. – David Arenburg Apr 07 '16 at 10:05
@M.D Thanks for answering. I also want the middle 2 columns in the final output. – Newbie Apr 07 '16 at 10:08
@DavidArenburg, you are right, lets then call it "mixed approach", where crucial parts are done with `data.table`, but pipe from `dplyr` makes it a bit more readable. I will add your pure `data.table` way to the answer. – m-dz Apr 07 '16 at 10:08
1

@Newbie There is a whole thread dedicated to that topic. Have you seen [this](http://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame) ? – David Arenburg Apr 07 '16 at 10:10
@DavidArenburg, I am not sure if the user referencing works in answers, but I have added yours comment to it. – m-dz Apr 07 '16 at 10:13
@M.D I have seen the link you provided but I am unable to understand how I can add all of the columns in the final output. I can do it other way by merging but I think there will be a simpler way than this. – Newbie Apr 07 '16 at 10:20
@M.D One thing I noticed now is that one condition which I posted in my original question is not fulfilled by your answer, that was, only those rows should be retained that appears in at least 2 data sets (as the input data.frame to your code is a merge of 3 data.frames). So, in simple words in this case as the first row having `Entrez.ID = 11302`should be discarded as it appeared only in one dataset. – user3253470 Apr 08 '16 at 11:29
@M.D I got it, I just had to add the filter: `filter(n() > 1)` – user3253470 Apr 08 '16 at 11:49
@user3253470, yeah, `filter(n() > 1)` in `dplyr` and `if(.N > 1)` for `data.table`, I've edited both approaches. – m-dz Apr 08 '16 at 12:20

Data.frame Merge and Selection of values which are common in 2 Data.frames

3 Answers3

preliminaries

`dplyr` way

Pure `data.table` way, after David Arenburg comment

Leave `Nearest.Refseq` and `Gene.Name` columns

Linked

Data.frame Merge and Selection of values which are common in 2 Data.frames

3 Answers3

preliminaries

dplyr way

Pure data.table way, after David Arenburg comment

Leave Nearest.Refseq and Gene.Name columns

Linked

`dplyr` way

Pure `data.table` way, after David Arenburg comment

Leave `Nearest.Refseq` and `Gene.Name` columns