Concatenate values by group in descending order

Question

I want a data.My data A looks like

author_id paper_id prob
   731    24943    1
   731    24943    1
   731   688974    1
   731   964345    .8
   731  1201905    .9
   731  1267992    1
   736    249      .2
   736   6889      1
   736   94345    .7
   736  1201905    .9
   736  126992    .8

The output I am desiring is:

author_id    paper_id
  731        24943,24943,688974,1201905,964345
  736        6889,1201945,126992,94345,249

That is paper_id are arranged according to decreasing order of probability.

If I use a combination of sql and R, i think the solution would be

statement<-"select * from A 
            GROUP BY author_id
            ORDER BY prob"

Then in R using paste once the order is set for paper_id.

But i need the total solution in R.How could this be done?

Thanks

why are there duplicates with the same author, paper, and probability? Also, the desired format doesn't seem very useful. Are you planning to use it for further analysis? — rawr, Mar 27 '14 at 11:19
@rawr-duplicates are to be present in the output.The output is the final result. — user3171906, Mar 27 '14 at 11:27
plyr is slightly overkill - you can do this easily in base R : `with(dat[order(dat$prob , decreasing=TRUE),], aggregate(paper_id , list(author_id=author_id) , paste , sep=","))` — user20650, Mar 27 '14 at 11:41

David Arenburg · Accepted Answer · 2014-11-08T20:06:04.007

If temp is your data set then do

library(data.table)
setDT(temp)[order(-prob), list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249

Edit: 8/11/2014

Since data.table v >= 1.9.4, you can use the very efficient setorder instead of order

str(temp)
setorder(setDT(temp), -prob)[, list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249

And as a side note, this whole thing could be easily done with base R too (though not recommended for big data sets)

aggregate(paper_id ~ author_id, temp[order(-temp$prob), ], paste, collapse = ", ")
#   author_id                                       paper_id
# 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
# 2       736              6889, 1201905, 126992, 94345, 249

+1, Or slightly simpler `data.table(df)[order(-prob), paste0(paper_id, collapse=", "), by=author_id]` — BrodieG, Mar 27 '14 at 12:53
df = data.table(df)[order(-prob),paste0(paper_id,collapse = ","),by = author_id]; setnames(df,"V1","author_id") — Henk, Mar 27 '14 at 14:41

score 6 · Answer 2 · edited Jul 03 '17 at 18:03

To complete the set, here's a dplyr answer:

df  <- read.table(header = T, text =
"author_id paper_id prob
731 24943 1
731 24943 1
731 688974 1
731 964345 .8
731 1201905 .9
731 1267992 1
736 249 .2
736 6889 1
736 94345 .7
736 1201905 .9
736 126992 .8") # your dataset

library(dplyr)
df %>%
  group_by(author_id) %>%
  arrange(desc(prob)) %>%
  summarise(paper_id = paste(paper_id, collapse = ", "))

## Source: local data frame [2 x 2]
## 
##   author_id                                       paper_id
## 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2       736              6889, 1201905, 126992, 94345, 249

Chitrasen · Answer 3 · 2014-03-27T12:02:53.360

3

You can try this

library('plyr')

subdf <- ddply(sample.df,.(author_id), function(df){
  ord <- order(df$prob,decreasing=T)
  return(data.frame(paper_id=paste(df$paper_id[ord],collapse=',')))
})

subdf 

  author_id                                  paper_id
1       731 24943,24943,688974,1267992,1201905,964345
2       736             6889,1201905,126992,94345,249

edited Mar 27 '14 at 12:02

answered Mar 27 '14 at 11:31

Chitrasen

1,706
18
15

Concatenate values by group in descending order

3 Answers3

Linked