1

I have next data table dataframe

library(dplyr)
library(data.table)

my_data = data.frame(
  id = c(1, 1, 2, 2, 3),
  sample_number = c('d1', 'rr1', 'd2', 'rr2', 'd3'),
  res_1 = c('AA', NA, NA, 'GG', 'AG'),
  res_2 = c(NA, 'TT', 'CC', NA, 'TC'),
  res_3 = c('II', 'II', 'DD', 'ID', 'ID')
)
my_data <- my_data %>% as.data.table() ## convert to data table
> my_data
  id sample_number res_1 res_2 res_3
1  1            d1    AA   <NA>   II
2  1           rr1   <NA>   TT    II
3  2            d2   <NA>   CC    DD
4  2           rr2    GG   <NA>   ID
5  3            d3    AG    TC    ID

Uniq column is id. For some id exists 2 rows with different values in sample_number column. How can I unite rows by id column? For id 2 in column res_3 mistake exists. In that case result of unite will by '---'. Result is next

id  sample_number   res_1 res_2   res_3
1   d1, rr1         AA    TT      II
2   d2, rr2         GG    CC      '---'
3   d3              AG    TC      ID
autumnrustle
  • 595
  • 1
  • 10
  • 21
  • @MauritsEvers but result of you code example is `sample_number 1 d1, rr1, d2, rr2, d3` – autumnrustle Apr 05 '22 at 07:49
  • Ok, I see what you mean. This is a bit more tricky than I thought. We are missing some additional information. For example, what are the rules for generating an entry `'---'`? Does this happen when you have different entries in any `res_*` column for the same `id`? – Maurits Evers Apr 05 '22 at 09:59
  • @MauritsEvers '---' this is a mark for mistake in our data. `sample_number` - this is different dna samples persons. Each person - `id`. Sometimes we have 2 dna samples for 1 person (`d1` and `rr1` - dna samples of person with `id=1`). DNA sampels handled by different researchers. Sometimes researchers make extra work - `res_3` for `id=1` - the same result (II), all fine. But sometimes we have mistakes - `res_3` and `id=2`. We cannot combine the results of the study because they don't match. And need to mark it, for example as `---` – autumnrustle Apr 05 '22 at 12:46
  • @MauritsEvers `Does this happen when you have different entries in any res_* column for the same id?` It hapens only we have in one column `res_*` for the same `id` different results. `res_3 and id=1 => II and II => no mistake`, `res_3 and id=2 => DD and ID => mistake => ---`, – autumnrustle Apr 05 '22 at 12:52
  • `The "NaN" strings (which are not true NAs) make it awkward.` - if I open dataframe I see NA italic – autumnrustle Apr 05 '22 at 12:54
  • replace NaN with NA – autumnrustle Apr 05 '22 at 13:12

2 Answers2

1

Here is an option

# Define custom function to collapse entries from columns `res_*`
collapse <- function(x) {
    if (length(unique(x[!is.na(x)])) == 1) unique(x[!is.na(x)]) else "----"
}
library(tidyverse)
my_data %>%
    group_by(id) %>%
    summarise(
        sample_number = toString(sample_number),
        across(starts_with("res"), collapse),
        .groups = "drop")
## A tibble: 3 x 5
#     id sample_number res_1 res_2 res_3
#  <dbl> <chr>         <chr> <chr> <chr>
#1     1 d1, rr1       AA    TT    II   
#2     2 d2, rr2       GG    CC    ---- 
#3     3 d3            AG    TC    ID   

Note that I assume that the NAs in your data.frame are real NAs, as in

my_data = data.frame(
    id = c(1, 1, 2, 2, 3),
    sample_number = c('d1', 'rr1', 'd2', 'rr2', 'd3'),
    res_1 = c('AA', NA, NA, 'GG', 'AG'),
    res_2 = c(NA, 'TT', 'CC', NA, 'TC'),
    res_3 = c('II', 'II', 'DD', 'ID', 'ID')
)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
1

a data.table approach

my_data[, sample_number := paste0(sample_number, collapse = ", "), by = .(id)]
DT <- melt(my_data, id.vars = c("id", "sample_number"), na.rm = TRUE)
dcast(DT, id + sample_number ~ variable, value.var = "value", 
      fun.aggregate = function(x) ifelse(length(unique(x)) > 1, "---", x))
#    id sample_number res_1 res_2 res_3
# 1:  1       d1, rr1    AA    TT    II
# 2:  2       d2, rr2    GG    CC   ---
# 3:  3            d3    AG    TC    ID
Wimpel
  • 26,031
  • 1
  • 20
  • 37