0

I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.

I've tried:

df.new <- group_by(df,"ID")

and also:

library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]

and unfortunately neither have worked. Any help is appreciated!

HAL-9000
  • 99
  • 2
  • 9
  • Are the observations that appear more than once the same for every column? If so, `unique(dt)` would work. – moman822 May 04 '17 at 18:43
  • No, for variables/columns 2:150, there are different values that I need to retain. When there is more than one ID, the rows themselves are not duplicates, just the ID#. – HAL-9000 May 04 '17 at 18:45
  • Just do `dt[, lapply(.SD, toString), by = ID]` – Sotos May 04 '17 at 19:01

3 Answers3

0

What about?

df %>%
  group_by(ID) %>%
  summarise_each(funs(paste0(., collapse = "/")))

Or reproducible...

iris %>%
  group_by(Species) %>%
  summarise_each(funs(paste0(., collapse = "/")))
Community
  • 1
  • 1
boshek
  • 4,100
  • 1
  • 31
  • 55
  • That could possibly work. However, I'm not just dealing with one dependent variable ("Sepal.Length"): I have over 150. Is there a way to reference all of those without having to do each name individually? Perhaps a column index? – HAL-9000 May 04 '17 at 18:54
  • Like using `summarise_each`? See above - I've changed the answer. – boshek May 04 '17 at 19:09
0

I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.

idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,]  # get all non duplicated values

temp <- dupes %>% group_by(key) %>%  # roll up the duplicated ones.
  fill_(colnames(dupes), .direction = "down") %>%  
  fill_(colnames(dupes), .direction = "up")  %>%  
  slice(1)  

Then it is easy to merge back the temp and the non_dupes.

EDIT

I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.

Drj
  • 1,176
  • 1
  • 11
  • 26
0

Using basic R:

df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
                day = c("1","2","3","4","5","6","7","8"),
                year = c(2016,2017,2017,2016,2017,2016,2017,2016),
                stringsAsFactors = F)

> df
  ID day year
1  a   1 2016
2  a   2 2017
3  b   3 2017
4  b   4 2016
5  b   5 2017
6  c   6 2016
7  d   7 2017
8  d   8 2016

Do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 
              function(x){ paste0(x, collapse = "/") }
              )

Result:

> z
  id   day           year
1  a   1/2      2016/2017
2  b 3/4/5 2017/2016/2017
3  c     6           2016
4  d   7/8      2017/2016

EDIT

If you want to avoid "collapsing" NA do:

z = aggregate(df[,2:3], 
              by = list(id = df$ID), 


        function(x){ paste0(x[!is.na(x)],collapse = "/") })

For a data frame like:

> df
  ID  day year
1  a    1 2016
2  a    2   NA
3  b    3 2017
4  b    4 2016
5  b <NA> 2017
6  c    6 2016
7  d    7 2017
8  d    8 2016

The result is:

> z
  id day           year
1  a 1/2           2016
2  b 3/4 2017/2016/2017
3  c   6           2016
4  d 7/8      2017/2016
R. Schifini
  • 9,085
  • 2
  • 26
  • 32
  • This looks to have done the trick! Thank you! One additional question: to omit any NAs, would it read: "na.action=na.omit", "na.action=na.pass", or "na.rm=TRUE"? I placed each of these after the last " } " in the statement, but none worked, saying they were unused arguments. – HAL-9000 May 04 '17 at 19:24
  • @Oryx added the change to avoid NA – R. Schifini May 04 '17 at 20:45