5

I need help with something that might be fairly simple in R. I want to refer to a range of columns in a data frame (e.g., extracting a few select variables). However, I don't know their column numbers. Normally, if I wanted to extract columns 4-10 i would say mydata[,4:10].

However, given that I don't know the column numbers, I would want to refer to them by name. Is there an easy way to do this? in sas or spss it is fairly easy to refer to a range of variables by name. Alternatively, is there an easy way to figure out which column number corresponds to a variable name in R?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
tcarpenter
  • 265
  • 1
  • 2
  • 11

6 Answers6

7

Getting a range of columns can be done in several ways. subset(data.frame, select = name4:name10), works but is quite long. I used that before I got annoyed writing long commands for a simple thing. I made a function to tackle the naming columns / not remembering column numbers in large data frames:

coln <- function(X){
  y <- rbind(seq(1,ncol(X)))
  colnames(y) <- colnames(X)
rownames(y) <- "col.number"
  return(y)} 

Here is how it works:

df <- data.frame(a = 1:10, b =10:1, c = 1:10)
coln(df)
           a b c
col.number 1 2 3

Now you can call them with numbers and still look at names.

Mikko
  • 7,530
  • 8
  • 55
  • 92
  • I like the strategy here. I will use this in combination with some other code. – tcarpenter Dec 05 '13 at 05:03
  • If you think this is the right answer you should select it using the tick mark to the left of the text. That gets points for the user who answered. This is how Stack Overflow works. – SlowLearner Dec 10 '13 at 11:19
  • Upvoted for the mention of `subset()`. It was exactly what I was looking for. – jorvaor Dec 01 '22 at 12:15
6

A column number can be identified from a column name within a data frame as follows:

which(colnames(mydf)=="a")

where mydf is a data frame and a is the name of the column the column number is required for.

(Source)

This can be used to create a column range:

firstcol = which(colnames(x)=="a")
lastcol = which(colnames(x)=="b")

mydf[c(firstcol:lastcol)]
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Matt_J
  • 76
  • 1
  • 3
2

Use %in% in combination with names(). It's useful for grabbing a group of columns from a data frame. You can negate the expression when you want to keep just a subset and drop the rest. Type ?"%in%" at the R Console prompt for more details.

set.seed(1234)
mydf <- data.frame(A = runif(5, 1, 2),
                   B = runif(5, 3, 4),
                   C = runif(5, 5, 6),
                   D = runif(5, 7, 8),
                   E = runif(5, 9, 10))
mydf

keep.cols <- c('A','D','E')
mydf[, names(mydf) %in% keep.cols]
drop.cols <- c('A','B','C')
mydf[, !names(mydf) %in% drop.cols]

The data frame:

> mydf
         A        B        C        D        E
1 1.113703 3.640311 5.693591 7.837296 9.316612
2 1.622299 3.009496 5.544975 7.286223 9.302693
3 1.609275 3.232551 5.282734 7.266821 9.159046
4 1.623379 3.666084 5.923433 7.186723 9.039996
5 1.860915 3.514251 5.292316 7.232226 9.218800

A subset of columns:

> mydf[, names(mydf) %in% keep.cols]
         A        D        E
1 1.113703 7.837296 9.316612
2 1.622299 7.286223 9.302693
3 1.609275 7.266821 9.159046
4 1.623379 7.186723 9.039996
5 1.860915 7.232226 9.218800

Keeping a subset of columns and dropping the rest:

> mydf[, !names(mydf) %in% drop.cols]
         D        E
1 7.837296 9.316612
2 7.286223 9.302693
3 7.266821 9.159046
4 7.186723 9.039996
5 7.232226 9.218800
SlowLearner
  • 7,907
  • 11
  • 49
  • 80
1

I think I figured it out, but it's a bit ornery. Here's an example using mtcars to get the columns between hp and vs. do.call usually means there is a simpler way, though.

mtcars[do.call(seq, as.list(match(c("hp", "vs"), colnames(mtcars))))]
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
0

Here is a fun little function that combines the ideas behind Largh's answer with a handy function call. To use it, just enter

call.cols(mydata, "firstvarname", "lastvarname")

call.cols <- function(df, startvar, endvar) {
  col.num <- function(df){
    var.nums <- seq(1,ncol(df))
    names(var.nums) <- colnames(df)      
    return(var.nums)
  } 

 start.num <- as.numeric(col.num(df)[startvar])
 end.num <- as.numeric(col.num(df)[endvar])
 range.num <- start.num:end.num
 return(df[range.num]) 
}

I plan to expand this to use for scale creation for psychometric research.

tcarpenter
  • 265
  • 1
  • 2
  • 11
  • 1
    The downside with your function is that you cannot subset columns outside of your range. With `subset` you can. For example: `subset(swiss, select = c(Fertility, Education:Infant.Mortality))`. Similarly you could use `coln`to get the column numbers: `swiss[c(1,4:6)]`. – Mikko Dec 05 '13 at 09:11
  • Ahh. I see the utility. Helpful! – tcarpenter Dec 06 '13 at 17:27
0

You can call the column numbers by their names:

set.seed(1234)
> mydf <- data.frame(A = runif(5, 1, 2),
                     + B = runif(5, 3, 4),
                     + C = runif(5, 5, 6),
                     + D = runif(5, 7, 8),
                     + E = runif(5, 9, 10))
> mydf
mydf[c(match("A", names(mydf)):match("B", names(mydf)))]
         A        B
1 1.113703 3.640311
2 1.622299 3.009496
3 1.609275 3.232551
4 1.623379 3.666084
5 1.860915 3.514251

Here you can see that the match()-call actually gives the column number:

> c(match("A", names(mydf)):match("B", names(mydf)))
[1] 1 2

I hope this is also helpful, it is similar to Neal's answer.

AlexWatts
  • 3
  • 2