Extracting specific columns from a data frame

Question

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

 data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

Joshua Ulrich · Answer 1 · 2020-06-30T14:20:29.953

516

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector

edited Jun 30 '20 at 14:20

answered Apr 10 '12 at 02:44

Joshua Ulrich

173,410
32
338
418

4

That gives the error `object of type 'closure' is not subsettable`. – Aren Cambre Apr 10 '12 at 02:48
24

@ArenCambre: then your data.frame isn't really named `df`. `df` is also a function in the stats package. – Joshua Ulrich Apr 10 '12 at 02:58
5

@ArenCambre: http://2.bp.blogspot.com/-XU9PduVhq-I/Um-Y6e19jZI/AAAAAAAADfI/PrmoFQexa5M/s1600/Book+last+page.jpg – tumultous_rooster Jan 20 '15 at 01:53
but why this one doesn't work `df[,-c("A","B","E")]`? – Cina Jun 27 '15 at 08:38
2

@Cina: Because `-"A"` is a syntax error. And `?Extract` says, "`i`, `j`, `...` can also be negative integers, indicating elements/slices to leave out of the selection." – Joshua Ulrich Jun 27 '15 at 14:43
I get `SyntaxError: invalid syntax` when I use this. – Richard Dec 17 '15 at 19:08
@Richard: I don't know what to tell you; it's most certainly valid syntax. I've also added a reproducible example. – Joshua Ulrich Dec 17 '15 at 19:23
8

There is an issue with this syntax because if we extract only one column R, returns a vector instead of a dataframe and this could be unwanted: `> df[,c("A")]` `[1] 1`. Using `subset` doesn't have this disadvantage. – David Dorchies Jul 27 '16 at 13:49
I noticed that using a character vector of names of columns often doesn't work when I have the package `data.table` loaded. In those cases, it's better to use one of the other methods, either `subset()` or dplyr's `select()`. – Paul de Barros Nov 02 '16 at 17:08
1

@PauldeBarros: whether `data.table` is loaded should not matter, unless the *object* you're trying to subset is a `data.table` and not a `data.frame`. This question is about subsetting a `data.frame`, not a `data.table`. – Joshua Ulrich Nov 02 '16 at 17:59
1

@JoshuaUlrich: I'm sure you're right. The objects that create problems are of class data.frame and data.table (according to `str`). Without data.table loaded, those same objects are just data.frames. I didn't realize that loading data.table befote running my code would change the class of those objects. The fact that the method you provided (and which I had been using) suddenly stopped working was surprising to me, and I thought others might benefit from knowing that the behavior of their code might change in that circumstance. – Paul de Barros Nov 02 '16 at 20:25
1

Maybe better `df[c("A","B","E")]` (without comma) ? we win one character and we address the issue that @david-dorchies pointed out – moodymudskipper May 22 '19 at 09:51
@Moody_Mudskipper Better is to set argument `drop = FALSE`: `df[, "A", drop = FALSE]`. – Rui Barradas Jun 30 '20 at 07:19
1

With list subsetting, the drop argument is unnecessary, that's what I meant. `df["A"]` is fine. – moodymudskipper Jun 30 '20 at 07:54

score 246 · Accepted Answer · answered Apr 19 '15 at 21:19

246

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)

answered Apr 19 '15 at 21:19

Sam Firke

21,571
9
87
105

5

Given the considerably evolution of the Tidyverse since posting my question, I've switched the answer to you. – Aren Cambre Aug 16 '18 at 13:58
6

Given the furious rate of change in the tidyverse, I would caution against using this pattern. This is in addition to my strong preference against treating column names as if they are object names when writing code for functions, packages, or applications. – Joshua Ulrich May 22 '19 at 11:21
3

It has been over four years since this answer was submitted, and the pattern hasn't changed. Piped expressions can be quite intuitive, which is why they are appealing. – Aren Cambre Jun 25 '19 at 01:57
how do I execute a further command onto this subset? E.g. I want to compute the rowMean: "df1 %>% rowMeans(select(A, B, E))" does not work. – Ben May 11 '20 at 06:14
1

You'd chain together a pipeline like: `df1 %>% select(A, B, E) %>% rowMeans(.)`. See the documentation for the `%>%` pipe by typing `?magrittr::\`%>%\`` – Sam Firke May 11 '20 at 15:52
3

This is a useful solution, but for the example given in the question, Josh's answer is more readable, faster, and dependency free. I hope new users learn square bracket subsetting before diving in the tidyverse :)! – moodymudskipper Aug 17 '21 at 09:54
`select(df, c('A','B','C'))` – Jul 26 '22 at 09:01

score 112 · Answer 3 · edited Jan 15 '14 at 00:24

112

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4

edited Jan 15 '14 at 00:24

Uli Köhler

13,012
16
70
120

answered Apr 10 '12 at 09:50

Stéphane Laurent

75,186
15
119
225

When I try this, with my data, I get the error: " Error in x[j] : invalid subscript type 'list' " But if c("A", "B") isn't a list, what is it? – Rafael_Espericueta Nov 28 '16 at 18:04
@Rafael_Espericueta Hard to guess without viewing your code... But `c("A", "B")` is a vector, not a list. – Stéphane Laurent Nov 28 '16 at 18:19
It convert data frame to list. – Suat Atan PhD Jun 21 '17 at 09:42

score 88 · Answer 4 · answered Apr 10 '12 at 06:49

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8

score 21 · Answer 5 · edited Mar 07 '19 at 04:24

21

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))

edited Mar 07 '19 at 04:24

Arthur Yip

5,810
2
31
50

answered Jun 10 '16 at 11:34

Richard Ball

540
5
14

8

This doesn't use `dplyr`. It uses `base::subset`, and is identical to [Stephane Laurent's answer](https://stackoverflow.com/a/10086494/903061) except that you use column numbers instead of column names. – Gregor Thomas Oct 12 '17 at 18:16

score 21 · Answer 6 · answered Oct 12 '17 at 18:12

21

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

answered Oct 12 '17 at 18:12

so860

408
3
12

score 15 · Answer 7 · edited Apr 20 '18 at 16:57

15

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

edited Apr 20 '18 at 16:57

Gilad Green

36,708
7
61
95

answered Nov 30 '16 at 08:00

Aman Burman

299
1
5
12

score 5 · Answer 8 · answered May 22 '19 at 09:49

5

You can use with :

with(df, data.frame(A, B, E))

answered May 22 '19 at 09:49

moodymudskipper

46,417
11
121
167

score 1 · Answer 9 · answered Oct 15 '19 at 19:54

1

df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)

answered Oct 15 '19 at 19:54

Mohamed Rahouma

1,084
9
20

This was already in the accepted answer – camille Feb 13 '22 at 18:01

score 0 · Answer 10 · answered Nov 09 '16 at 15:32

0

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)

answered Nov 09 '16 at 15:32

fxi

607
8
16

5

Not if you set `drop=FALSE`. Example: `df[,c("a"),drop=F]` – untill Sep 19 '17 at 10:48

Extracting specific columns from a data frame

10 Answers10

Linked

Related