Order and subset a multi-column dataframe in R?

Question

I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work

print(df[order(df$x) & df$x < 5,])

This does not order the results.

To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects

df <- data.frame(x = sample(1:50))

print(df[order(df$x) & df$x < 5,])

This does not order the results so I felt I had reproduced the problem but with simpler data.

Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object

df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))

produces

[1] "integer"

Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.

print(ndf[ndf$x < 5, ])

obviously generates an error:

Error in ndf$x : $ operator is invalid for atomic vectors.

Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object

ndf <- df[df$x < 5,]

class(ndf)
[1] "integer"

It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected

df <- data.frame(x = sample(1:50), y = rnorm(50))

ndf <- df[order(df$x),]

print(ndf[ndf$x < 5, ])

and this solved my original problem, but led to two further questions:

Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?

data.frames in R automatically simplify to vectors when selecting just one column: http://stackoverflow.com/questions/21025609/how-do-i-extract-a-single-column-from-a-data-frame-as-a-data-frame (you can prevent that with `drop=FALSE`). Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). — MrFlick, Mar 08 '17 at 16:34

score 6 · Accepted Answer · edited May 23 '17 at 12:31

A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.

Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense

df[order(df$x) & df$x < 5,]

Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()

subset(df[order(df$x),], x < 5)

This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.

Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with

library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)

Order and subset a multi-column dataframe in R?

1 Answers1