how do you subset a data frame based on column names?

Question

I have this data frame:

 dput(df)
structure(list(Server = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "servera", class = "factor"), 
    Date = structure(1:6, .Label = c("7/13/2017 15:01", "7/13/2017 15:02", 
    "7/13/2017 15:03", "7/13/2017 15:04", "7/13/2017 15:05", 
    "7/13/2017 15:06"), class = "factor"), Host_CPU = c(1.812950134, 
    2.288070679, 1.563278198, 1.925239563, 5.350669861, 2.612503052
    ), UsedMemPercent = c(38.19, 38.19, 38.19, 38.19, 38.19, 
    38.22), jvm1 = c(10.91, 11.13, 11.34, 11.56, 11.77, 11.99
    ), jvm2 = c(11.47, 11.7, 11.91, 12.13, 12.35, 12.57), jvm3 = c(75.65, 
    76.88, 56.93, 58.99, 65.29, 67.97), jvm4 = c(39.43, 40.86, 
    42.27, 43.71, 45.09, 45.33), jvm5 = c(27.42, 29.63, 31.02, 
    32.37, 33.72, 37.71)), .Names = c("Server", "Date", "Host_CPU", 
"UsedMemPercent", "jvm1", "jvm2", "jvm3", "jvm4", "jvm5"), class = "data.frame", row.names = c(NA, 
-6L))

I only want to be able to subset this data frame based on the vectors names in this variable:

select<-c("jvm3", "jvm4", "jvm5")

so, my final df should look like this:

structure(list(Server = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "servera", class = "factor"), 
    Date = structure(1:6, .Label = c("7/13/2017 15:01", "7/13/2017 15:02", 
    "7/13/2017 15:03", "7/13/2017 15:04", "7/13/2017 15:05", 
    "7/13/2017 15:06"), class = "factor"), Host_CPU = c(1.812950134, 
    2.288070679, 1.563278198, 1.925239563, 5.350669861, 2.612503052
    ), UsedMemPercent = c(38.19, 38.19, 38.19, 38.19, 38.19, 
    38.22), jvm3 = c(75.65, 76.88, 56.93, 58.99, 65.29, 67.97
    ), jvm4 = c(39.43, 40.86, 42.27, 43.71, 45.09, 45.33), jvm5 = c(27.42, 
    29.63, 31.02, 32.37, 33.72, 37.71)), .Names = c("Server", 
"Date", "Host_CPU", "UsedMemPercent", "jvm3", "jvm4", "jvm5"), class = "data.frame", row.names = c(NA, 
-6L))

any ideas?

`df[c("Server", "Date", "Host_CPU", "UsedMemPercent", select)]`. Or you can use `df[, c("Server", "Date", "Host_CPU", "UsedMemPercent", select)]`. Or `subset(select = c("Server", "Date", "Host_CPU", "UsedMemPercent", select))`. See `?subset` for details. Or `?[`. — Gregor Thomas, Jul 14 '17 at 17:59
Note that taking the extra stop to modify the output from dput into something that can be pasted directly into R is very much appreciated. So instead of just the output from `dput(your_data)` it'd be nice if you pasted it into the form `your_data <- {insert the dput output here}` — Dason, Jul 14 '17 at 18:00
@Gregor, I get this error: Error in `[.data.frame`(data, c("Server", "Date", "Host_CPU", "UsedMemPercent", : undefined columns selected — user1471980, Jul 14 '17 at 18:04
I missed the `df` in `subset`, it should be `subset(df, select = c("Server", "Date", "Host_CPU", "UsedMemPercent", select))`. But the others all run on the data you shared in a fresh R session. — Gregor Thomas, Jul 14 '17 at 18:06
Or even simply `df[select]`. Which is the absolute first thing you should've learnt about R: how indices work. — Joris Meys, Jul 14 '17 at 18:43
Related question : https://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name?rq=1 — Joris Meys, Jul 14 '17 at 18:57

Joris Meys · Answer 1 · 2017-07-14T19:06:25.143

Please revisit indices. If you use the index mechanism [ in R, you can use mainly three types of indices:

logical vectors: same length as the number of columns, TRUE means select the column
numeric vectors: selects columns based on position
character vectors: select columns based on name

If you use the index mechanism for data frames, you can treat these objects in two ways:

as a list, because they are internally lists
as a matrix, because they mimick matrix behaviour in many cases

Take the iris data frame as example to compare the multiple ways you can select columns from a data frame. If you treat it as a list, you have the following two options:

Use [[ if you want a single column in the form of a vector:

iris[["Species"]]
# [1] setosa     setosa     setosa ... : is a vector

Use [ if you want one or more columns, but you need a data frame back :

iris["Species"]
iris[c("Sepal.Width", "Species")]

If you treat it as a matrix, you just do the exact same as you would do with a matrix. If you don't specify any row indices, these commands are actually equivalent to the ones used above:

iris[ , "Species"] # is the same as iris[["Species"]]
iris[ , "Species", drop = FALSE] # is the same as iris["Species"]
iris[ , c("Sepal.Width", "Species")] # is the same as iris[c("Sepal.Width", "Species")]

So in your case, you simply need:

select <- c("Server","Date","Host_CPU","UsedMemPercent",
            "jvm3","jvm4","jvm5")
df[select]

Note on subset: subset works, but should ONLY be used interactively. There's a warning on the help page stating :

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

score 2 · Accepted Answer · edited Jul 14 '17 at 18:03

Saving your dataframe to a variable df:

df <-
  structure(
    list(
      Server = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "servera", class = "factor"),
      Date = structure(
        1:6,
        .Label = c(
          "7/13/2017 15:01",
          "7/13/2017 15:02",
          "7/13/2017 15:03",
          "7/13/2017 15:04",
          "7/13/2017 15:05",
          "7/13/2017 15:06"
        ),
        class = "factor"
      ),
      Host_CPU = c(
        1.812950134,
        2.288070679,
        1.563278198,
        1.925239563,
        5.350669861,
        2.612503052
      ),
      UsedMemPercent = c(38.19, 38.19, 38.19, 38.19, 38.19,
                         38.22),
      jvm1 = c(10.91, 11.13, 11.34, 11.56, 11.77, 11.99),
      jvm2 = c(11.47, 11.7, 11.91, 12.13, 12.35, 12.57),
      jvm3 = c(75.65,
               76.88, 56.93, 58.99, 65.29, 67.97),
      jvm4 = c(39.43, 40.86,
               42.27, 43.71, 45.09, 45.33),
      jvm5 = c(27.42, 29.63, 31.02,
               32.37, 33.72, 37.71)
    ),
    .Names = c(
      "Server",
      "Date",
      "Host_CPU",
      "UsedMemPercent",
      "jvm1",
      "jvm2",
      "jvm3",
      "jvm4",
      "jvm5"
    ),
    class = "data.frame",
    row.names = c(NA,-6L)
  )

df[,select] should be what youre looking for

@user1471980 This answer works perfectly fine if you create `select` obviously. But you didn't specify you wanted to keep a few other ones as well. — Joris Meys, Jul 14 '17 at 19:05
@user1471980 Yeah I misunderstood your question, looks like you need: `cbind(df[,1:4], df[,select])` — Alex Braksator, Jul 14 '17 at 19:09

score 1 · Answer 3 · answered Jul 14 '17 at 18:13

1

Here's one way:

df[,c(1:4,7:9)]

You can also use dplyr to select columns:

select(df, Server,Date,Host_CPU,UsedMemPercent,jvm3,jvm4,jvm5)

answered Jul 14 '17 at 18:13

Mako212

6,787
1
18
37

how do you subset a data frame based on column names?

3 Answers3

Linked

Related