0

I have two data sets and I want to add them together based on matching column names. For the data sets, each row represents a study site and each column represents a survey. Each survey is restricted to one block. I want to add the contents of each data set together based on column names.

Data set 1 (1 denotes a survey was performed):

Block A1  A2  A3  A4  A5
    1  0   1   0   0   0
    2  1   0   0   1   0
    3  0   0   1   0   1

Data set 2 is a subset of data set 1 in which the 1 now represents that the species of interest was found during the survey

Data set 2:

Block  A1  A2  A4  A5
    1   0   1   0   0
    3   0   0   0   1

The ideal out put would look something like this where 2 represents that a survey was performed and the species of interest was found, 1 represents that a survey was performed and 0 that no surveys were performed.

Data set 3:

Block A1  A2  A3  A4  A5
  1    0   2   0   0   0
  2    1   0   0   1   0
  3    0   0   1   0   2
Devin Mendez
  • 101
  • 8
  • How big is your data? – akash87 May 14 '19 at 17:13
  • I have ~3500 surveys (columns) over 220 sites (rows) for data set 1 and dataset 2 has ~130 surveys over 56 sites. – Devin Mendez May 14 '19 at 17:18
  • And you don't have every column of ever survey? – akash87 May 14 '19 at 17:18
  • Correct, in data set 2 the surveys that didn't contain the species of interest were removed. It was the only way I could get the data out of access. – Devin Mendez May 14 '19 at 17:25
  • 1
    You want the addition to be based on column *and row*, right? Is there a typo in the output? Dataset 2, row label 3 has a 1 in column 4, but it looks like that got added to the column 3 row 3 from Dataset 2... – Gregor Thomas May 14 '19 at 17:25
  • Yes, there was a typo. I was just created the data to symbolize what I wanted. It should be corrected now. Thank you for pointing that out. – Devin Mendez May 14 '19 at 17:33
  • @divibisan they are similar, but we're dealing with a case here where one dataframe doesn't have all of the columns. –  May 14 '19 at 18:13
  • 2
    @gersht `bind_rows` doesn't require both dataframes to have the same columns, so there's no difference there. In fact, your answer to this question is identical to the accepted answer on that duplicate (with the exception of using the `na.rm=T` parameter) – divibisan May 14 '19 at 18:18
  • @divibisan one dataframe above doesn't have all the columns. That's the difference. The OP also explicitly says they want to sum based on matching column names, while the other OP wants to sum duplicate rows. I think they are what Atwood would call [borderline duplicates](https://stackoverflow.blog/2009/04/29/handling-duplicate-questions/): "There’s often benefit to having multiple subtle variants of a question around, as people tend to ask and search using completely different words, and the better our coverage, the better odds our fellow programmers can find the answer they’re looking for." –  May 14 '19 at 18:39
  • 1
    @gersht That's fine, people have different opinions on duplicates – that's why we have the moderation and close vote system we have. Still, I think the fact that the code in the answer is identical to that in the duplicate is a pretty good clue that this is, in fact, a duplicate – divibisan May 14 '19 at 19:03
  • @divibisan the "opinion" in this case is SO's preferred approach. It also makes sense: The same problem can be formulated differently, and apparent duplicates can highlight different aspects of the same problem. Still, I don't entirely disagree with you. My last comment was meant to move us past the "duplicate vs. nonduplicate" discussion. It's obvious that they are, at some level, duplicates. But both also add something unique. The data in the other post have, for instance, the same number of columns with the same names, which suggests an approach that doesn't necessitate `bind_rows`. –  May 15 '19 at 07:59

3 Answers3

1

So I would approach this a bit more inefficiently than other people here:

library(tidyverse)

data1 %>% 
gather(key,value,-Block) %>% 
rbind(data2 %>% 
gather(key,value, -Block)) %>% 
group_by(Block, key) %>% 
summarise(All_vals = sum(value)) %>% 
spread(key, All_vals)

Here is the result:

 Block    A1    A2    A3    A4    A5
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     0     2     0     0     0
2     2     1     0     0     1     0
3     3     0     0     1     1     2
akash87
  • 3,876
  • 3
  • 14
  • 30
  • This is nice in that it would work just as well if `data2` has some elements missing from `data1`, not just the other way. – Gregor Thomas May 14 '19 at 17:53
  • I am relatively new to R and I don't understand what "%>%" symbolizes. – Devin Mendez May 14 '19 at 18:11
  • @DevinMendez it is the pipe operator and is part of the `tidyverse` language and it means "pass the results from the left hand side of the pipe and pass it to the following." This allows us to process data without necessarily storing intermediate steps. – akash87 May 14 '19 at 18:15
  • @akash87 thank you for explaining that to me. I received the error "Error in sum(value) : invalid 'type' (character) of argument" – Devin Mendez May 14 '19 at 18:27
  • Are you posting real data or example data? – akash87 May 14 '19 at 18:28
  • It was example data. All of the data is binary though. – Devin Mendez May 14 '19 at 18:38
  • So print out part of the data after your run the code from `data1` through the first `gather` – akash87 May 14 '19 at 18:40
1

Edit

You seem a little uncomfortable using dplyr verbs and %>%, so I've decided to add a couple other possible solutions that you might find more appealing.

Note that none of these solutions will work if one or more of your variables is non-numeric, which seems to be the case for your real data. You'll need to find out which are non-numeric, and either convert them to numeric or drop them.

Using base R

The rbind function will concatenate dataframes by matching variables if they all have the same variables. You should add missing variables to your dataframes with value NA, then rbind them. You can then call aggregate on a subset of your combined dataframes and sum by the levels of Block:

full_df <- rbind(df1, cbind(df2, A3 = NA))
aggregate(full_df[,2:6], list(Block = full_df$Block), sum, na.rm = T)

#### OUTPUT ####

  Block A1 A2 A3 A4 A5
1     1  0  2  0  0  0
2     2  1  0  0  1  0
3     3  0  0  1  0  2

Using dplyr

The dplyr function bind_rows is quite flexible, and makes concatenating dataframes a little easier by matching shared variables, and automatically filling unmatched variables with NA. Group by Block and use summarise_all to apply the function to every variable:

library(dplyr)

bind_rows(df1, df2) %>%
    group_by(Block) %>% 
    summarise_all(sum, na.rm = T)

#### OUTPUT ####

# A tibble: 3 x 6
  Block    A1    A2    A3    A4    A5
  <int> <int> <int> <int> <int> <int>
1     1     0     2     0     0     0
2     2     1     0     0     1     0
3     3     0     0     1     0     2

Using data.table

Another option would be to use data.table, which has a reputation for being fast, as well as a different syntax that some prefer.

library(data.table)

full_df <- rbindlist(list(df1, df2), fill = T)
full_df[, lapply(.SD, sum, na.rm = T), by = "Block"]

#### OUTPUT ####

   Block A1 A2 A3 A4 A5
1:     1  0  2  0  0  0
2:     2  1  0  0  1  0
3:     3  0  0  1  0  2
  • When I tried running through this code I received "Error in group_by(Blocks, add = TRUE) : object 'Blocks' not found" – Devin Mendez May 14 '19 at 18:19
  • "this code" doesn't include the argument `add = TRUE`, so it sounds like your trying something else. –  May 14 '19 at 18:25
  • Originally I had deleted the %>% which caused that error. Once I inserted those back in I received the error "Error in sum(rn, na.rm = TRUE) : invalid 'type' (character) of argument" – Devin Mendez May 14 '19 at 18:41
  • @DevinMendez if you change the code and don't show me your changes I can't help you. –  May 14 '19 at 18:44
  • I used the same exact code that you posted above, and received the error "Error in sum(rn, na.rm = TRUE) : invalid 'type' (character) of argument" – Devin Mendez May 14 '19 at 19:05
  • @DevinMendez it's because column `rn` in your dataframe is type character. Either remove it or convert it into type numeric. –  May 14 '19 at 19:27
  • @DevinMendez I've added a couple more options so you don't have to deal with the `%>%` if you don't want to. However, you will still have to convert your character variable to numeric or drop it. –  May 15 '19 at 09:50
  • 1
    I was able to get the dplyr method to work once I transposed my data. Not sure if it worked because there were less columns to deal with? Anyways thank you for your help! – Devin Mendez May 16 '19 at 14:25
0

Using base R!

matrix2 <- matrix(rep(0, length.out = nrow(dataset1)*ncol(dataset1)),
ncol = ncol(dataset1))
#then make sure the column names match dataset1 (not dataset2)
names(matrix2) <- names(dataset1)
for (i in 1:ncol(matrix2)) {
if (any(names(dataset2) == names(matrix2)[i]
matrix2[,i] <- dataset2[,which(names(dataset2) == names(matrix2)[i]]
}
}

Then add the these columns, which are in the same position/order as dataset1 (or are zero otherwise), to dataset1:

for (i in 1:ncol(dataset1) {dataset1[,i] <- dataset1[,i] + matrix2[,i]}
Dij
  • 1,318
  • 1
  • 7
  • 13