Merging replicate scores but mark the differences

Question

This is what I have:

df <- structure(list(Sample = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 
                                    4L), .Label = c("19-0001", "19-0002", "19-0003", "19-0004"), class = "factor"), 
               Replicate = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), X24854000 = structure(c(1L, 
                                                                                      2L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "CC"), class = "factor"), 
               X24854056 = structure(c(3L, 3L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
                                                                                   "AA", "GG"), class = "factor"), X24854764 = structure(c(1L, 
                                                                                                                                           1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "TA", class = "factor"), 
               X24854903 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24855066 = structure(c(1L, 1L, 
                                                                                                                                     3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", "CA", "CC"), class = "factor"), 
               X24855114 = structure(c(2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "GA", "GG"), class = "factor"), X24855316 = structure(c(2L, 
                                                                                                                                           2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", "TC"), class = "factor"), 
               X24855449 = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "GG"), class = "factor"), X24855925 = structure(c(2L, 1L, 
                                                                                                                                     1L, 3L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24856070 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CC", 
                                                                                   "CT"), class = "factor"), X24856086 = structure(c(2L, 1L, 
                                                                                                                                     2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CC", "CT"), class = "factor"), 
               X24856329 = structure(c(2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
                                                                                   "AG"), class = "factor"), X24856389 = structure(c(2L, 1L, 
                                                                                                                                     1L, 1L, 2L, 2L, 2L, 2L), .Label = c("", "GG"), class = "factor"), 
               X24857235 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("", 
                                                                                   "CT"), class = "factor"), X24857350 = structure(c(3L, 3L, 
                                                                                                                                     1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", "GA", "GG"), class = "factor"), 
               X24857404 = structure(c(1L, 3L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("", 
                                                                                   "AT", "TT"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                        -8L))

This generates this table

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1       GG  TA          GA  TC  CC  GA  CT  CT  AG  GG      GG
19-0001 2   CC  GG  TA              TC  GG      CC  CC              GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG      CC  CT  AG
19-0002 2           TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT
19-0004 2           TA      CA  GA      CC      CC  CT  AG  GG

This is what I want:

Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   CC  GG  TA          GA  TC  99  GA  99  99  AG  GG      GG  TT
19-0002 1   CC  AA  TA      CC  GG      GG  GG  CC  CT  AG
19-0003 1   CC      TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1           TA      CA  GA  TC  CC      CC  CT  AG  GG  CT

Merging of replicate 1 and 2 under the same sample name. Missing or same score can be replaced by the other but any mismatches should be replaced by "99" so they can be removed later.

I tried:

data_merge <- data %>%
    group_by(Sample) %>%
    summarise_all(ifelse(statement), (if_true), (if_false))

I only subset the data, the real data have 44 of X numbers.

Please provide sample data in a reproducible format, e.g. using `dput`. — Maurits Evers, Sep 18 '19 at 04:06
I am not familiar with dput and I tried dput(out, file = "test.txt", control = c("keepNA", "keepInteger")) but the output file doesn't look the same as input one. — R Sun, Sep 18 '19 at 04:25
The use of `dput` is explained in a post on how to provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). In short, do `dput(df)` (where `df` is your `data.frame`), and then include (i.e. copy&paste) the output of `dput` in your main post (not as a comment). — Maurits Evers, Sep 18 '19 at 04:33
Thanks. The link is actually very useful compare to the instruction of the package itself. I will use that next time when I have R problem. — R Sun, Sep 19 '19 at 02:38
Glad it was helpful @RSun. Please consider closing the question by setting the green check mark next to the answer. That way you help keeping SO tidy and make it easier for future SO readers to identify relevant questions. Thanks. — Maurits Evers, Sep 19 '19 at 02:42

Maurits Evers · Answer 1 · 2019-09-18T04:48:40.400

0

Here is an option

df %>%
    mutate_if(is.factor, as.character) %>%
    group_by(Sample) %>%
    summarise_at(
        vars(starts_with("X")),
        ~if_else(length(unique(.x[.x != ""])) == 1, first(.x[.x != ""]), "99"))
## A tibble: 4 x 17
#  Sample X24854000 X24854056 X24854764 X24854903 X24855066 X24855114 X24855316
#  <chr>  <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>
#1 19-00… CC        GG        TA        99        99        GA        TC
#2 19-00… CC        AA        TA        99        CC        GG        99
#3 19-00… CC        99        TA        CT        CA        GA        TC
#4 19-00… 99        99        TA        99        CA        GA        TC
## … with 9 more variables: X24855449 <chr>, X24855925 <chr>, X24856070 <chr>,
##   X24856086 <chr>, X24856329 <chr>, X24856389 <chr>, X24857235 <chr>,
##   X24857350 <chr>, X24857404 <chr>

Sample data

df <- read.table(text =
    "Sample  Replicate   X24854000   X24854056   X24854764   X24854903   X24855066   X24855114   X24855316   X24855449   X24855925   X24856070   X24856086   X24856329   X24856389   X24857235   X24857350   X24857404
19-0001 1   ''  GG  TA  ''  ''  GA  TC  CC  GA  CT  CT  AG  GG  ''  GG  ''
19-0001 2   CC  GG  TA  ''  ''  ''  TC  GG  ''  CC  CC  ''  ''  ''  GG  TT
19-0002 1   CC  AA  TA  ''  CC  GG  ''  GG  ''  CC  CT  AG  ''  ''  ''  ''
19-0002 2   ''  ''  TA  ''  CC  GG  ''  GG  GG  CC  CT  AG  ''  ''  ''  ''
19-0003 1   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0003 2   CC  ''  TA  CT  CA  GA  TC  CC  GA  CC  CT  AG  GG  CT  GA  AT
19-0004 1   ''  ''  TA  ''  CA  GA  TC  CC  ''  CC  CT  AG  GG  CT  ''  ''
19-0004 2   ''  ''  TA  ''  CA  GA  ''  CC  ''  CC  CT  AG  GG  ''  ''  ''", header = T)

edited Sep 18 '19 at 04:48

answered Sep 18 '19 at 04:20

Maurits Evers

49,617
4
47
68

Hi Maurits, thanks so much for the code and re-creating the sample data. I am sorry to be a pain but I have this error message 'Error: No tidyselect variables were registered Call `rlang::last_error()` to see a backtrace'. I used your re-created sample data and I still have the same error. – R Sun Sep 19 '19 at 02:47
@RSun Hmm, it works on my end. I've tested this on `dplyr_0.8.3`. What version of `dplyr` do you have? – Maurits Evers Sep 19 '19 at 02:54
Also for samples where both replications have missing score. I want to leave it blank or insert 0 into it. Only replications with mismatch scores that I want to replace with 99. – R Sun Sep 19 '19 at 03:01
@RSun *"Also for samples where both replications have missing score. I want to leave it blank or insert 0 into it."* This is becoming a more complex problem statement than your original question. Let's go step-by-step: First confirm that you can reproduce the example I give in my answer. – Maurits Evers Sep 19 '19 at 03:04
@RSun You still need to update your main post to give *reproducible* sample data (i.e. include the output of `dput`). – Maurits Evers Sep 19 '19 at 03:05
I use the same version. – R Sun Sep 19 '19 at 03:06
@RSun Then the issue with you not being able to reproduce the example I give must lie elsewhere on your side. I have double-checked, and the example I give is 100% reproducible. No error should appear. – Maurits Evers Sep 19 '19 at 03:08
I ran both of your scripts (table and the code) and got the error. My expected table has a blank insert where both replicate scores are missing and 99 where there is a mismatch. Lastly, I used dput to recreate my data as request. I still have the same error. – R Sun Sep 19 '19 at 03:30

Merging replicate scores but mark the differences

1 Answers1

Sample data