Merging columns and adding NA where there is no value

Question

I would like to merge 19 columns with different length from different dataframes and compare them. This is an example:

df1: 
PA0001
PA0002
PA0003
PA0004
PA0005

df2: 
PA0001
PA0003
PA0006
PA0007


df3: 
PA0001
PA0007

etc...

And the output is something like that:

PA0001  | PA0001 | PA0001
PA0002  |  NA    | NA
PA0003  | PA0003 | NA                   
PA0004  | NA     | NA
PA0005  | NA     | NA
NA      | PA0006 | NA
NA      | PA0007 | PA0007

I used compare or merge function but I didn't have a good result. I have tried to used the function of this question: Link

But I got this error:

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [5254] must be the same length as the vector [2]

Here you are and example:

test1 <- data.frame(c("PA0001","PA0002","PA0003","PA0004","PA0005","PA0006"))
test2 <- data.frame(c("PA0001","PA0002","PA0004","PA0005","PA0007"))
test3 <- data.frame(c("PA0001","PA0004","PA0005","PA0007", "PA0008"))

Thank you very much.

Please read the answer to the question you read. ... `stringsAsFactors = FALSE`. You have to coerce your columns to character. — jogo, Feb 16 '18 at 11:48
@jogo I did it and I got this error `Error in attributes(.Data) <- c(attributes(.Data), attrib) : 'names' attribute [5254] must be the same length as the vector [2]` In using just one column and with different names. — Marcus, Feb 16 '18 at 11:51
For me the answer https://stackoverflow.com/a/43266762/5414452 works with `df1 = data.frame(TEST1=c("PA0001","PA0002","PA0003","PA0004","PA0005","PA0006"), stringsAsFactors = FALSE) ` and so on — jogo, Feb 16 '18 at 12:02
The example you created and the input/output you showed are different — akrun, Feb 16 '18 at 12:09

akrun · Answer 1 · 2018-02-16T12:26:59.887

If we need the output as in the OP's expected, place the datasets in a list, rbind the list elements while creating a 'grp' column with rbindlist, then dcast from 'long' to 'wide' while creating a sequence column in the formula by matching the 'id' with the unique elements of 'id'

library(data.table)
dcast(rbindlist(list(test1, test2, test3), idcol = 'grp'),
         match(id, unique(id)) ~ paste0("col", grp))[, id := NULL][]
#     col1   col2   col3
#1: PA0001 PA0001 PA0001
#2: PA0002     NA     NA
#3: PA0003 PA0003     NA
#4: PA0004     NA     NA
#5: PA0005     NA     NA
#6:     NA PA0006     NA
#7:     NA PA0007 PA0007

Or as @jogo split the code to make it more clear, in the first step, rbind all the datasets in the list while creating the 'grp' column by specifying the idcol argument

t_all <- rbindlist(list(test1, test2, test3), idcol='grp');

Then dcast, to 'wide' format and assign the 'id' column to NULL

dcast(t_all, id ~ grp, value.var='id')[, id := NULL][]

data

test1 <- data.frame(id = c("PA0001","PA0002","PA0003","PA0004","PA0005"))
test2 <- data.frame(id = c("PA0001","PA0003","PA0006","PA0007"))
test3 <- data.frame(id = c("PA0001", "PA0007"))

Roman · Answer 2 · 2018-02-16T12:10:16.690

You can try a tidyverse solution

library(tidyverse)
d1 <- read.table(text="PA0001
           PA0002
           PA0003
           PA0004
           PA0005")
d2 <- read.table(text="PA0001
           PA0003
           PA0006
           PA0007")
d3 <- read.table(text="PA0001
           PA0007")

list(d1, d2, d3) %>% 
  bind_rows(.id = "df") %>% 
  mutate(n = TRUE) %>% 
  spread(df, n, fill = FALSE)
      V1     1     2     3
1 PA0001  TRUE  TRUE  TRUE
2 PA0002  TRUE FALSE FALSE
3 PA0003  TRUE  TRUE FALSE
4 PA0004  TRUE FALSE FALSE
5 PA0005  TRUE FALSE FALSE
6 PA0006 FALSE  TRUE FALSE
7 PA0007 FALSE  TRUE  TRUE

The idea is to include all data.frames in a list, then bind them rowwise, add a logical TRUE and use tidyr's spread function to obtain the result. Of course you can get your wxpected output as well using:

list(d1, d2, d3) %>% 
  bind_rows(.id="df") %>% 
  mutate(n=V1) %>% 
  spread(df, n) %>% 
  select(-1)
       1      2      3
1 PA0001 PA0001 PA0001
2 PA0002   <NA>   <NA>
3 PA0003 PA0003   <NA>
4 PA0004   <NA>   <NA>
5 PA0005   <NA>   <NA>
6   <NA> PA0006   <NA>
7   <NA> PA0007 PA000

In base R you can try:

Reduce(function(x, y) merge(x, y, by="V1", all.x = TRUE, all.y = TRUE),
              lapply(list(d1, d2, d3), function(x) cbind(x,V2=x$V1)))[,-1]
    V2.x   V2.y     V2
1 PA0001 PA0001 PA0001
2 PA0002   <NA>   <NA>
3 PA0003 PA0003   <NA>
4 PA0004   <NA>   <NA>
5 PA0005   <NA>   <NA>
6   <NA> PA0006   <NA>
7   <NA> PA0007 PA0007

score 2 · Accepted Answer · answered Feb 16 '18 at 12:14

Here is a base R solution:

x <- lapply(list(test1, test2, test3), function(x) as.character(x[,1]))
xuni <- unique(unlist(x))
one_set <- function(i) { 
    idx <- which(is.na(match(xuni, x[[i]])))
    ans <- xuni
    ans[idx] <- NA
    return(ans)
}
res <- data.frame(
    test1=one_set(1),
    test2=one_set(2),
    test3=one_set(3),
    stringsAsFactors=FALSE
)
res

   test1  test2  test3
1 PA0001 PA0001 PA0001
2 PA0002 PA0002   <NA>
3 PA0003   <NA>   <NA>
4 PA0004 PA0004 PA0004
5 PA0005 PA0005 PA0005
6 PA0006   <NA>   <NA>
7   <NA> PA0007 PA0007
8   <NA>   <NA> PA0008

The first line converts the factors to character and transforms the data.frame to a list x of elements in each data.frame.

The second line identifies all observed values.

The function one_set takes one index (which element of x to look at) and returns the desired output column.

Merging columns and adding NA where there is no value

3 Answers3

data