0

I have a data.frame in R. I need to compare two rows of the data and if they are the same I need to merge the rows and combine the data in one column. I feel like this is a common need when working with R so using ddply or some other package should be able to accomplish this task. Below is the data as is, dat, and what it should like after some code, foo. I’m new with R so any help is greatly appreciated.

Before:

 dat <- structure(list(detected_id = c(11, 11, 4), reviewer_name = c("mike", 
"mike", "john"), created_at = c("2016-05-04 10:02:45", "2016-05-04 10:02:45", 
"2016-05-04 10:02:45"), stage = c(2L, 2L, 1L), V7 = c("Detected Organism: Staphylococcus Aureus, Comment: Looks good", 
"Detected Organism: Staphylococcus Aureus, Comment: Note 1", 
"Detected Organism: Human Adenovirus 7, Comment: test")), .Names = c("detected_id", 
"reviewer_name", "created_at", "stage", "V7"), row.names = c(NA, 
-3L), class = "data.frame")

After:

foo <- structure(list(detected_id = c(11L, 4L), reviewer_name = c("mike", 
"john"), created_at = structure(c(1L, 1L), .Label = "5/4/16 10:02", class = "factor"), 
    stage = c(2L, 1L), V7 = structure(c(2L, 1L), .Label = c("Detected Organism: Human Adenovirus 7, Comment: test", 
    "Detected Organism: Staphylococcus Aureus, Comment: Looks good; Detected Organism: Staphylococcus Aureus, Comment: Note 1"
    ), class = "factor")), .Names = c("detected_id", "reviewer_name", 
"created_at", "stage", "V7"), row.names = c(NA, -2L), class = "data.frame")

quick look

EDIT:

the solutions below worked for the dataset I provided, however I've found a case where these solutions don't actually work as intended. This is an example of a data.frame that fails. Just a note, the detected_id column is obsolete for me.

dat <- structure(list(detected_id = c(11, 11, 11, 11, 12, 4), reviewer_name = c("Mike", 
"Mike", "Mike", "Mike", "John", "John"), created_at = c("2016-05-04 10:02:45", 
"2016-05-04 10:02:45", "2016-05-04 10:02:45", "2016-05-04 10:02:45", 
"2016-05-04 10:02:45", "2016-05-04 10:02:45"), stage = c(2L, 
3L, 2L, 3L, 1L, 1L), V7 = c("Detected Organism: Staphylococcus Aureus, Comment: Looks good", 
"Detected Organism: Staphylococcus Aureus, Comment: Looks good", 
"Detected Organism: Staphylococcus Aureus, Comment: Note 1", 
"Detected Organism: Staphylococcus Aureus, Comment: Note 1", 
"Detected Organism: Stenotrophomonas Maltophilia, Comment: new note", 
"Detected Organism: Human Adenovirus 7, Comment: test")), .Names = c("detected_id", 
"reviewer_name", "created_at", "stage", "V7"), row.names = c(NA, 
-6L), class = "data.frame")

SOLUTION: remove the detected_id column before reshaping the data.frame, Thanks @eddi

webDevleoper101
  • 69
  • 3
  • 14

2 Answers2

3
library(data.table)

setDT(dat)[, paste(V7, collapse = "; ")
           , by = .(detected_id, reviewer_name, created_at, stage)]
#   detected_id reviewer_name          created_at stage
#1:          11          mike 2016-05-04 10:02:45     2
#2:           4          john 2016-05-04 10:02:45     1
#                                                                                                                         V1
#1: Detected Organism: Staphylococcus Aureus, Comment: Looks good; Detected Organism: Staphylococcus Aureus, Comment: Note 1
#2:                                                                     Detected Organism: Human Adenovirus 7, Comment: test
eddi
  • 49,088
  • 6
  • 104
  • 155
0

using base R

with(dat, aggregate(V7,list(detected_id=detected_id, reviewer_name=reviewer_name, created_at=created_at, stage=stage),paste,collapse=' '))
Ananta
  • 3,671
  • 3
  • 22
  • 26