0

I have a data reorganization task that I think could be handled by R's plyr package. I have a dataframe with numeric data organized in groups. Within each group I need to have the data sorted largest to smallest.

The data looks like this (code to generate below)

group     value
2     b 0.1408790
6     b 1.1450040   #2nd b is smaller than 1st
1     c 5.7433568
3     c 2.2109819
4     d 0.5384659
5     d 4.5382979

What I would like is this.

group     value
b 1.1450040  #1st b is largest
b 0.1408790
c 5.7433568
c 2.2109819
d 4.5382979
d 0.5384659

So, what I need plyr to do is go through each group & apply something like order on the numeric data, reorganize by order, save the reordered subset of data, & put it all back together at the end.

I can process this "by hand" with a list & some loops, but it takes a long long time. Can this be done by plyr in a couple of lines?

Example data

df.sz <-  6;groups <-c("a","b","c","d")
df <- data.frame(group = sample(groups,df.sz,replace = TRUE),
value = runif(df.sz,0,10),stringsAsFactors = FALSE)
df <- df[order(df$group),] #order by group letter

The inefficient approach using loops:

My current approach is to separate the dataframe df into a list by groups, apply order to each element of the list, and overwrite the original list element with the reordered element. I then use a loop to re-assemble the dataframe. (As a learning exercise, I'd interested also in how to make this code more efficient. In particular, what would be the most efficient way using base R functions to turn a list into a dataframe?)

Vector of the unique groups in the dataframe

groups.u <- unique(df$group)

Create empty list

my.list <- as.list(groups.u); names(my.list) <- groups.u

Break up df by $group into list

for(i in 1:length(groups.u)){
  i.working <- which(df$group == groups.u[i]) 
  my.list[[i]] <- df[i.working, ]
}

Sort elements within list using order

for(i in 1:length(my.list)){
  order.x <- order(my.list[[i]]$value,na.last = TRUE, decreasing = TRUE)
  my.list[[i]] <- my.list[[i]][order.x, ] 
}

Finally rebuild df from the list. 1st, make seed for loop

new.df <- my.list[[1]][1,];; new.df[1,] <- NA
for(i in 1:length(my.list)){
  new.df <- rbind(new.df,my.list[[i]])
}

Remove seed

new.df <- new.df[-1,]
N Brouwer
  • 4,778
  • 7
  • 30
  • 35

2 Answers2

4

You could use dplyr which is a newer version of plyr that focuses on data frames:

library(dplyr)
arrange(df, group, desc(value))
davechilders
  • 8,693
  • 2
  • 18
  • 18
3

It's virtually sacrilegious to include a "data.table" response in a question tagged "plyr" or "dplyr", but your comment indicates you're looking for fast compact code.

In "data.table", you could use setorder, like this:

 setorder(setDT(df), group, -value)

That command does two things:

  1. It converts your data.frame to a data.table without copying.
  2. It sorts your columns by reference (again, no copying).

You mention "> 50k rows". That's actually not very large, and even base R should be able to handle it well. In terms of "dplyr" and "data.table", you're looking at measurements in the milliseconds. That could make a difference as your input datasets become larger.

set.seed(1)
df.sz <- 50000
groups <- c(letters, LETTERS)
df <- data.frame(
  group = sample(groups, df.sz, replace = TRUE),
  value = runif(df.sz,0,10), stringsAsFactors = FALSE)
library(data.table)
library(dplyr)
library(microbenchmark)
dt1 <- function() as.data.table(df)[order(group, -value)]
dt2 <- function() setorder(as.data.table(df), group, -value)[]
dp1 <- function() arrange(df, group, desc(value))
microbenchmark(dt1(), dt2(), dp1())
# Unit: milliseconds
#   expr       min        lq      mean    median        uq       max neval
#  dt1()  5.749002  5.981274  7.725225  6.270664  8.831899 67.402052   100
#  dt2()  4.956020  5.096143  5.750724  5.229124  5.663545  8.620155   100
#  dp1() 37.305364 37.779725 39.837303 38.169298 40.589519 96.809736   100
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485