445

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

 data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

M--
  • 25,431
  • 8
  • 61
  • 93
Aren Cambre
  • 6,540
  • 9
  • 30
  • 36

10 Answers10

516

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 4
    That gives the error `object of type 'closure' is not subsettable`. – Aren Cambre Apr 10 '12 at 02:48
  • 24
    @ArenCambre: then your data.frame isn't really named `df`. `df` is also a function in the stats package. – Joshua Ulrich Apr 10 '12 at 02:58
  • 5
    @ArenCambre: http://2.bp.blogspot.com/-XU9PduVhq-I/Um-Y6e19jZI/AAAAAAAADfI/PrmoFQexa5M/s1600/Book+last+page.jpg – tumultous_rooster Jan 20 '15 at 01:53
  • but why this one doesn't work `df[,-c("A","B","E")]`? – Cina Jun 27 '15 at 08:38
  • 2
    @Cina: Because `-"A"` is a syntax error. And `?Extract` says, "`i`, `j`, `...` can also be negative integers, indicating elements/slices to leave out of the selection." – Joshua Ulrich Jun 27 '15 at 14:43
  • I get `SyntaxError: invalid syntax` when I use this. – Richard Dec 17 '15 at 19:08
  • @Richard: I don't know what to tell you; it's most certainly valid syntax. I've also added a reproducible example. – Joshua Ulrich Dec 17 '15 at 19:23
  • 8
    There is an issue with this syntax because if we extract only one column R, returns a vector instead of a dataframe and this could be unwanted: `> df[,c("A")]` `[1] 1`. Using `subset` doesn't have this disadvantage. – David Dorchies Jul 27 '16 at 13:49
  • I noticed that using a character vector of names of columns often doesn't work when I have the package `data.table` loaded. In those cases, it's better to use one of the other methods, either `subset()` or dplyr's `select()`. – Paul de Barros Nov 02 '16 at 17:08
  • 1
    @PauldeBarros: whether `data.table` is loaded should not matter, unless the *object* you're trying to subset is a `data.table` and not a `data.frame`. This question is about subsetting a `data.frame`, not a `data.table`. – Joshua Ulrich Nov 02 '16 at 17:59
  • 1
    @JoshuaUlrich: I'm sure you're right. The objects that create problems are of class data.frame and data.table (according to `str`). Without data.table loaded, those same objects are just data.frames. I didn't realize that loading data.table befote running my code would change the class of those objects. The fact that the method you provided (and which I had been using) suddenly stopped working was surprising to me, and I thought others might benefit from knowing that the behavior of their code might change in that circumstance. – Paul de Barros Nov 02 '16 at 20:25
  • 1
    Maybe better `df[c("A","B","E")]` (without comma) ? we win one character and we address the issue that @david-dorchies pointed out – moodymudskipper May 22 '19 at 09:51
  • @Moody_Mudskipper Better is to set argument `drop = FALSE`: `df[, "A", drop = FALSE]`. – Rui Barradas Jun 30 '20 at 07:19
  • 1
    With list subsetting, the drop argument is unnecessary, that's what I meant. `df["A"]` is fine. – moodymudskipper Jun 30 '20 at 07:54
246

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 5
    Given the considerably evolution of the Tidyverse since posting my question, I've switched the answer to you. – Aren Cambre Aug 16 '18 at 13:58
  • 6
    Given the furious rate of change in the tidyverse, I would caution against using this pattern. This is in addition to my strong preference against treating column names as if they are object names when writing code for functions, packages, or applications. – Joshua Ulrich May 22 '19 at 11:21
  • 3
    It has been over four years since this answer was submitted, and the pattern hasn't changed. Piped expressions can be quite intuitive, which is why they are appealing. – Aren Cambre Jun 25 '19 at 01:57
  • how do I execute a further command onto this subset? E.g. I want to compute the rowMean: "df1 %>% rowMeans(select(A, B, E))" does not work. – Ben May 11 '20 at 06:14
  • 1
    You'd chain together a pipeline like: `df1 %>% select(A, B, E) %>% rowMeans(.)`. See the documentation for the `%>%` pipe by typing `?magrittr::\`%>%\`` – Sam Firke May 11 '20 at 15:52
  • 3
    This is a useful solution, but for the example given in the question, Josh's answer is more readable, faster, and dependency free. I hope new users learn square bracket subsetting before diving in the tidyverse :)! – moodymudskipper Aug 17 '21 at 09:54
  • `select(df, c('A','B','C'))` –  Jul 26 '22 at 09:01
112

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4
Uli Köhler
  • 13,012
  • 16
  • 70
  • 120
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225
88

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8
Henry
  • 6,704
  • 2
  • 23
  • 39
21

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))
Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
Richard Ball
  • 540
  • 5
  • 14
  • 8
    This doesn't use `dplyr`. It uses `base::subset`, and is identical to [Stephane Laurent's answer](https://stackoverflow.com/a/10086494/903061) except that you use column numbers instead of column names. – Gregor Thomas Oct 12 '17 at 18:16
21

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

so860
  • 408
  • 3
  • 12
15

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

Gilad Green
  • 36,708
  • 7
  • 61
  • 95
Aman Burman
  • 299
  • 1
  • 5
  • 12
5

You can use with :

with(df, data.frame(A, B, E))
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
1
df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)
Mohamed Rahouma
  • 1,084
  • 9
  • 20
0

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)
fxi
  • 607
  • 8
  • 16