1

So here is my challenge. I am trying to get rid of rows of data that are best organized as a column. The original data set looks like

1|1|a
2|3|b
2|5|c
1|4|d
1|2|e
10|10|f

And the end result desired is

1 |1,2,4 |a| e d
2 |3,5   |b| c
10|10    |f| NA

The table's shaping is based from minimum value Col 2 within groupings of Col 1, where new column 3 is defined from the minimum values within the group and new column 4 is collapsed from not the minimum of. Some of the approaches tried include:

newTable[min(newTable[,(1%o%2)]),] ## returns the minimum of both COL 1 and 2 only

ddply(newTable,"V1", summarize, newCol = paste(V7,collapse = " ")) ## collapses all values by Col 1 and creates a new column nicely.

Variations to combine these lines of code into a single line have not worked, in part to my limited knowledge. These modifications are not included here.

Aaron
  • 317
  • 4
  • 16

2 Answers2

4

Try:

 library(dplyr)
 library(tidyr)

 dat %>% 
     group_by(V1) %>% 
     summarise_each(funs(paste(sort(.), collapse=","))) %>%
     extract(V3, c("V3", "V4"), "(.),?(.*)")

gives the output

  #  V1    V2 V3  V4
  #1  1 1,2,4  a d,e
  #2  2   3,5  b   c
  #3 10    10  f    

Or using aggregate and str_split_fixed

 res1 <- aggregate(.~ V1, data=dat, FUN=function(x) paste(sort(x), collapse=","))
 library(stringr)
 res1[, paste0("V", 3:4)] <- as.data.frame(str_split_fixed(res1$V3, ",", 2), 
                                              stringsAsFactors=FALSE)

If you need NA for missing values

  res1[res1==''] <- NA
  res1
  # V1    V2 V3   V4
 #1  1 1,2,4  a  d,e
 #2  2   3,5  b    c
 #3 10    10  f <NA>

data

dat <- structure(list(V1 = c(1L, 2L, 2L, 1L, 1L, 10L), V2 = c(1L, 3L, 
5L, 4L, 2L, 10L), V3 = c("a", "b", "c", "d", "e", "f")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Very functional. For future users in this situation. If V1 and V2 are numerical, R 3.1.1 will return "Error: length(n) == 1 is not TRUE". In order to function, the numbers must be an Integer variable. Appreciate the assist! – Aaron Oct 02 '14 at 11:35
  • @Aaron I am using `R 3.1.0`. I guess you were able to fix the error by converting to `integer`.. – akrun Oct 02 '14 at 11:45
  • I ran the code on another data set with numeric columns, and then -yes- converted them from numeric class to integer. It worked nicely. :-) – Aaron Oct 02 '14 at 13:03
3

Here's an approach using data.table, with data from @akrun's post:

It might be useful to store the columns as list instead of pasting them together.

require(data.table) ## 1.9.2+
setDT(dat)[order(V1, V2), list(V2=list(V2), V3=V3[1L], V4=list(V3[-1L])), by=V1]
#    V1    V2 V3  V4
# 1:  1 1,2,4  a e,d
# 2:  2   3,5  b   c
# 3: 10    10  f    

setDT(dat) converts the data.frame to data.table, by reference (without copying it). Then, we sort it by columns V1,V2 and group by V1 column on the sorted data, and for each group, we create the columns V2, V3 and V4 as shown.

V2 and V4 will be of type list here. If you'd rather have a character column where all entries are pasted together, just replace list(.) with paste(., sep=...).

HTH

Arun
  • 116,683
  • 26
  • 284
  • 387