-1

I am fairly new to r, and I am working with a large data set. I made an example of what my problem is below (data set is tab delineated). Basically I want to collapse all data by its ID number so that all of its attributes are contained in 1 cell instead of many cells.

The actual data set I am working with is genomic in nature, with the "ID" being the "gene name" and the "attribute" being the "pathway" that the gene is associated with. My data set is ~5,000,000 rows long.

I have tried messing around with cbind and rbind, but they do not seem to be specific enough for what I need.

My data set currently looks something like this:

ID  Attributes
1   apple
1   banana
1   orange
1   pineapple
2   apple
2   banana
2   orange
3   apple
3   banana
3   pineapple

And I want it to look like this:

ID  Attributes
1   apple,banana,orange,pineapple
2   apple,banana,orange
3   apple,banana,pineapple

If you have another way besides using r, that would work as well. Thank you for your help

Kevin Arseneau
  • 6,186
  • 1
  • 21
  • 40

2 Answers2

0

a base solution. To split df by ID, then paste the Attributes together. Then rbind the list of results.

do.call(rbind, by(df, df$ID, 
    function(x) data.frame(ID=x$ID[1], Attributes=paste(x$Attributes, collapse=","))
))

data:

df <- read.table(text="ID  Attributes
1   apple
1   banana
1   orange
1   pineapple
2   apple
2   banana
2   orange
3   apple
3   banana
3   pineapple", header=TRUE)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

A approach would be to group_by your ID and summarise with paste.

library(dplyr)

df <- read.table(text = "
  ID  Attributes
  1   apple
  1   banana
  1   orange
  1   pineapple
  2   apple
  2   banana
  2   orange
  3   apple
  3   banana
  3   pineapple", header = TRUE, stringsAsFactors = FALSE)

df %>%
  group_by(ID) %>%
  summarise(
    Attributes = paste(Attributes, collapse = ", ")
  )

# # A tibble: 3 x 2
#      ID Attributes                      
#   <int> <chr>                           
# 1     1 apple, banana, orange, pineapple
# 2     2 apple, banana, orange           
# 3     3 apple, banana, pineapple
Kevin Arseneau
  • 6,186
  • 1
  • 21
  • 40