2

I have a dataframe where the columns are alphabetically ordered (COL_A, COL_B, COL_C, etc.). How can I subset out columns H to M, without writing out the column names explicitly, or without counting how many columns there are?

Edit for clarification: I don't mean to say that the columns are exactly COL_A, COL_B and so on, merely that they are alphabetically arranged with any common structure in the name-strings. They can be, for example, Alabama, Arkansas, Texas, Wyoming, and Zambia. In an nutshell, I am trying to find an alternative to df_subset = df[,n1:n2] where I can directly put in the column names instead of having to count out the column numbers n1 and n2.

user702432
  • 11,898
  • 21
  • 55
  • 70

3 Answers3

3
d = as.data.frame(matrix(1:26,nrow=1))
names(d) = paste("COL_", LETTERS, sep="")
grep("[H-M]$", names(d))
d[, grepl("[H-M]$", names(d))]
baptiste
  • 75,767
  • 19
  • 198
  • 294
3

A more general approach is to use the >= and <= operators applied to the column names. Here is an example using a data frame where the column names are US states:

> df <- data.frame(as.list(state.abb))
> colnames(df) <- state.name
> df[, 1:3]
  Alabama Alaska Arizona
1      AL     AK      AZ
> df[colnames(df) >= "Florida" & colnames(df) <= "Illinois"]
  Florida Georgia Hawaii Idaho Illinois
1      FL      GA     HI    ID       IL

Another method would be to find the indices of the two boundaries using match and build a sequence between those two:

> df[seq(from = match("Florida", colnames(df)),
+        to   = match("Illinois", colnames(df)))]
  Florida Georgia Hawaii Idaho Illinois
1      FL      GA     HI    ID       IL
flodel
  • 87,577
  • 21
  • 185
  • 223
  • I was actually trying to find a alphabet/string version of df[,n1:n2], i.e. I wanted to avoid counting the actual column numbers. Your second option is perfect. Many thanks, flodel. – user702432 Mar 07 '12 at 03:29
1

You can try something like this

dfrm <- data.frame(replicate(26, rnorm(10)))
colnames(dfrm) <- paste("COL", LETTERS, sep="_")
which(substr(colnames(dfrm), 5, 6) %in% LETTERS[3:6])

The last expression returns column number that match the letters C to F. See also match, and this related thread: Get column index from label in a data frame.

Community
  • 1
  • 1
chl
  • 27,771
  • 5
  • 51
  • 71