R: Consolidating duplicate observations?

Question

I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.

I've tried:

df.new <- group_by(df,"ID")

and also:

library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]

and unfortunately neither have worked. Any help is appreciated!

Are the observations that appear more than once the same for every column? If so, `unique(dt)` would work. — moman822, May 04 '17 at 18:43
No, for variables/columns 2:150, there are different values that I need to retain. When there is more than one ID, the rows themselves are not duplicates, just the ID#. — HAL-9000, May 04 '17 at 18:45

score 0 · Answer 1 · edited May 23 '17 at 11:47

0

What about?

df %>%
  group_by(ID) %>%
  summarise_each(funs(paste0(., collapse = "/")))

Or reproducible...

iris %>%
  group_by(Species) %>%
  summarise_each(funs(paste0(., collapse = "/")))

edited May 23 '17 at 11:47

Community

1
1

answered May 04 '17 at 18:46

boshek

4,100
1
31
55

That could possibly work. However, I'm not just dealing with one dependent variable ("Sepal.Length"): I have over 150. Is there a way to reference all of those without having to do each name individually? Perhaps a column index? – HAL-9000 May 04 '17 at 18:54
Like using `summarise_each`? See above - I've changed the answer. – boshek May 04 '17 at 19:09

score 0 · Answer 2 · answered May 04 '17 at 18:48

I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.

idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,]  # get all non duplicated values

temp <- dupes %>% group_by(key) %>%  # roll up the duplicated ones.
  fill_(colnames(dupes), .direction = "down") %>%  
  fill_(colnames(dupes), .direction = "up")  %>%  
  slice(1)

Then it is easy to merge back the temp and the non_dupes.

EDIT

I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.

R. Schifini · Accepted Answer · 2017-05-04T20:45:08.447

Using basic R:

df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
                day = c("1","2","3","4","5","6","7","8"),
                year = c(2016,2017,2017,2016,2017,2016,2017,2016),
                stringsAsFactors = F)

> df
  ID day year
1  a   1 2016
2  a   2 2017
3  b   3 2017
4  b   4 2016
5  b   5 2017
6  c   6 2016
7  d   7 2017
8  d   8 2016

Do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 
              function(x){ paste0(x, collapse = "/") }
              )

Result:

> z
  id   day           year
1  a   1/2      2016/2017
2  b 3/4/5 2017/2016/2017
3  c     6           2016
4  d   7/8      2017/2016

EDIT

If you want to avoid "collapsing" NA do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 


        function(x){ paste0(x[!is.na(x)],collapse = "/") })

For a data frame like:

> df
  ID  day year
1  a    1 2016
2  a    2   NA
3  b    3 2017
4  b    4 2016
5  b <NA> 2017
6  c    6 2016
7  d    7 2017
8  d    8 2016

The result is:

> z
  id day           year
1  a 1/2           2016
2  b 3/4 2017/2016/2017
3  c   6           2016
4  d 7/8      2017/2016

This looks to have done the trick! Thank you! One additional question: to omit any NAs, would it read: "na.action=na.omit", "na.action=na.pass", or "na.rm=TRUE"? I placed each of these after the last " } " in the statement, but none worked, saying they were unused arguments. — HAL-9000, May 04 '17 at 19:24

R: Consolidating duplicate observations?

3 Answers3