How to summarize rows of a data frame into one while removing Duplicates in R?

Question

so, I have a data frame with 2 or more rows and different columns (ID, Location, Task, Skill, ...). I want to summarize these rows into (a) one row (dataframe) where different column entries should be joined together (but only if different! i.e. if for two rows the IDs are the same, the final dataframe row should show only one ID not the same twice i.e. "ID1", but if they are different, both should be shown i.e. 'ID1, ID2") and some numerical values should be added (+) together.

df = data.frame("ID" = c(PA1, PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures,  react to changes
", "PO - data based operations"), "Field" = c("PA","PA"), "Work" = c(0.5, 0.1), "Skill1" = c(CRO, CRO), "Skill2" = c(0, PPto), "ds" = c(5, 5))
print(df)

and the output should look like this

df_final = data.frame("ID" = c(PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures,  react to changes, data based operations"), "Field" = c("PA"), "Work" = c(0.6), "Skill1" = c(CRO), "Skill2" = c(PPto), "ds" = c(5))
print(df_final)

Thank you!

Please make a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) or [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with a sample input (not posted in an image, consider using `dput()`) and your expected output. — Martin Gal, Apr 17 '22 at 12:56

score 0 · Answer 1 · answered Apr 17 '22 at 14:46

Let's ignore Skill2 for now: How close is the following code to what you want to do?

df2 %>%
  group_by(ID)%>%
  summarise(work = sum(Work), 
            skill1 = unique(Skill1),
            ds = unique(ds), 
            occupation = paste0(Occupation, collapse = " "),
            field = unique(Field))

You can also mutate(occupation = str_replace_all(occupation, "PO - ")) to get rid of the duplicate "PO - "'s.

You're going to run into problems if the variables like Skill1/Skill2/ds are not unique to each ID, as in they have cardinality > 1.

df2 %>%
  group_by(ID)%>%
  summarise(work = sum(Work), 
            skill1 = unique(Skill1),
            skill2 = unique(Skill2),
            ds = unique(ds), 
            occupation = paste0(Occupation, collapse = " "),
            field = unique(Field))

If it's a simple data-entry issue, you could do a bit of wrangling to filter for only Skill2 entries with letters contained, and then join this frame back to your original frame.

You could also use the past0() collapse = trick, but then you'll end up with Skill2 = c(NA, "PPto"), which I'm pretty sure you don't want.

How to summarize rows of a data frame into one while removing Duplicates in R?

1 Answers1