3

I want a data.My data A looks like

author_id paper_id prob
   731    24943    1
   731    24943    1
   731   688974    1
   731   964345    .8
   731  1201905    .9
   731  1267992    1
   736    249      .2
   736   6889      1
   736   94345    .7
   736  1201905    .9
   736  126992    .8

The output I am desiring is:

author_id    paper_id
  731        24943,24943,688974,1201905,964345
  736        6889,1201945,126992,94345,249

That is paper_id are arranged according to decreasing order of probability.

If I use a combination of sql and R, i think the solution would be

statement<-"select * from A 
            GROUP BY author_id
            ORDER BY prob"

Then in R using paste once the order is set for paper_id.

But i need the total solution in R.How could this be done?

Thanks

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
user3171906
  • 543
  • 2
  • 9
  • 17
  • 2
    why are there duplicates with the same author, paper, and probability? Also, the desired format doesn't seem very useful. Are you planning to use it for further analysis? – rawr Mar 27 '14 at 11:19
  • @rawr-duplicates are to be present in the output.The output is the final result. – user3171906 Mar 27 '14 at 11:27
  • 1
    plyr is slightly overkill - you can do this easily in base R : `with(dat[order(dat$prob , decreasing=TRUE),], aggregate(paper_id , list(author_id=author_id) , paste , sep=","))` – user20650 Mar 27 '14 at 11:41
  • @user20650 : your code worked like a gem, thanks – user3171906 Mar 28 '14 at 08:07

3 Answers3

10

If temp is your data set then do

library(data.table)
setDT(temp)[order(-prob), list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249

Edit: 8/11/2014

Since data.table v >= 1.9.4, you can use the very efficient setorder instead of order

str(temp)
setorder(setDT(temp), -prob)[, list(paper_id = paste0(paper_id, collapse=", ")), by = author_id]
##    author_id                                       paper_id
## 1:       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2:       736              6889, 1201905, 126992, 94345, 249

And as a side note, this whole thing could be easily done with base R too (though not recommended for big data sets)

aggregate(paper_id ~ author_id, temp[order(-temp$prob), ], paste, collapse = ", ")
#   author_id                                       paper_id
# 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
# 2       736              6889, 1201905, 126992, 94345, 249
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 4
    +1, Or slightly simpler `data.table(df)[order(-prob), paste0(paper_id, collapse=", "), by=author_id]` – BrodieG Mar 27 '14 at 12:53
  • df = data.table(df)[order(-prob),paste0(paper_id,collapse = ","),by = author_id]; setnames(df,"V1","author_id") – Henk Mar 27 '14 at 14:41
6

To complete the set, here's a dplyr answer:

df  <- read.table(header = T, text =
"author_id paper_id prob
731 24943 1
731 24943 1
731 688974 1
731 964345 .8
731 1201905 .9
731 1267992 1
736 249 .2
736 6889 1
736 94345 .7
736 1201905 .9
736 126992 .8") # your dataset

library(dplyr)
df %>%
  group_by(author_id) %>%
  arrange(desc(prob)) %>%
  summarise(paper_id = paste(paper_id, collapse = ", "))

## Source: local data frame [2 x 2]
## 
##   author_id                                       paper_id
## 1       731 24943, 24943, 688974, 1267992, 1201905, 964345
## 2       736              6889, 1201905, 126992, 94345, 249
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
hadley
  • 102,019
  • 32
  • 183
  • 245
3

You can try this

library('plyr')

subdf <- ddply(sample.df,.(author_id), function(df){
  ord <- order(df$prob,decreasing=T)
  return(data.frame(paper_id=paste(df$paper_id[ord],collapse=',')))
})

subdf 

  author_id                                  paper_id
1       731 24943,24943,688974,1267992,1201905,964345
2       736             6889,1201905,126992,94345,249
Chitrasen
  • 1,706
  • 18
  • 15