40

Not so much 'How do you...?' but more 'How do YOU...?'

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?

Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:

dat$col101 <- NULL
dat$col102 <- NULL # etc

or

dat <- dat[,c("col1","col2",...)]

or

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

or

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.

EDIT:

Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?

Community
  • 1
  • 1
nzcoops
  • 9,132
  • 8
  • 41
  • 52
  • 8
    6000th [r] question. – Iterator Aug 16 '11 at 00:12
  • By the way, your question's title is "rows" from a data.frame, but it asks about columns from a file. An edit of the title may be a good idea. :) – Iterator Aug 16 '11 at 00:19
  • 2
    Related: [Remove an entire column from a data.frame in R](http://stackoverflow.com/questions/6286313/remove-an-entire-column-from-a-data-frame-in-r) – Joshua Ulrich Aug 16 '11 at 01:21
  • 1
    Please be careful with list-like questions; they tend to elicit Answers that are i) just list options, and ii) opinions, and are difficult to Answer correctly - there is no accepted Answer for "What do YOU use?"; every Answer is correct. – Gavin Simpson Aug 16 '11 at 08:49
  • @nzcoops About your edit subquestion: `cat(shQuote(names(iris)), sep=", ")`, so wrap it into function and use it ;) – Marek Aug 16 '11 at 10:20
  • Or `writeLines(strwrap(paste(shQuote(names(iris)),collapse=", "), width = 60))` – Marek Aug 16 '11 at 10:24
  • This should be CW b/c of question type and impossibility of single best answer. – Iterator Aug 16 '11 at 11:16
  • @Marek, thanks, nice solution, I knew there was a concatenate but couldn't remember what it was. – nzcoops Aug 16 '11 at 23:52
  • Re CW. I agree, where is the CW? I've seen some questions tagged as CW but I can't see anywhere how to do that. People often mention questions should be CW but no one (I'm yet to see) actually links to it. I do like discussions like this to know I'm not doing something inefficiently over and over, the nitty gritty is often overlooked. Like @MDowies comment about the whole table being copied. – nzcoops Aug 16 '11 at 23:55

12 Answers12

57

I use data.table's := operator to delete columns instantly regardless of the size of the table.

DT[, coltodelete := NULL]

or

DT[, c("col1","col20") := NULL]

or

DT[, (125:135) := NULL]

or

DT[, (variableHoldingNamesOrNumbers) := NULL]

Any solution using <- or subset will copy the whole table. data.table's := operator merely modifies the internal vector of pointers to the columns, in place. That operation is therefore (almost) instant.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 2
    for data frames I get this error: `Error: could not find function ":="`. So I guess this post is outdated. – Pio May 04 '14 at 08:56
  • 4
    @Pio, I hope you kidding right? This works only for `data.table` class objects, i.e, if your data frame is `df` you do `library(data.table); setDT(df)[,c("col1","col20"):=NULL]` etc... – David Arenburg Jul 14 '14 at 22:21
32

To delete single columns, I'll just use dat$x <- NULL.

To delete multiple columns, but less than about 3-4, I'll use dat$x <- dat$y <- dat$z <- NULL.

For more than that, I'll use subset, with negative names (!):

subset(mtcars, , -c(mpg, cyl, disp, hp))
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 5
    I use `dat[c("x","y","z")] <- list(NULL)` to remove columns. And usually divide it into two steps: generate names to remove `to_remove <- c("x","y","z")`, but it could be e.g. `names(dat)[sapply(dat, function(x) all(is.na(x)))]` then `dat[to_remove] <- list(NULL)`. – Marek Aug 16 '11 at 13:50
  • `dat$x <- NULL` ... Warning: Coercing LHS to a list? hmmm that's not good – Brian Jul 04 '13 at 15:15
  • 1
    Weirdly i get this: `Error in -c("V2","V3"): invalid argument to unary operator ` . when i try to deselect variable for the subset – Hercules Apergis May 15 '17 at 14:30
10

For clarity purposes, I often use the select argument in subset. With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption. As their skills increase, so too will their coding ability. And subset is one of the first commands I show people when needing to select data within a given criteria.

Something like:

> subset(mtcars, select = c("mpg", "cyl", "vs", "am"))
                     mpg cyl vs am
Mazda RX4           21.0   6  0  1
Mazda RX4 Wag       21.0   6  0  1
Datsun 710          22.8   4  1  1
....

I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.

Chase
  • 67,710
  • 18
  • 144
  • 161
7

Use read.table with colClasses instances of "NULL" to avoid creating them in the first place:

## example data and temp file
x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE)
tmp <- tempfile()
write.table(x, tmp, row.names = FALSE)


(y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE))

x a
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j

unlink(tmp)
mdsumner
  • 29,099
  • 6
  • 83
  • 91
5

For the kinds of large files I tend to get, I generally wouldn't even do this in R. I would use the cut command in Linux to process data before it gets to R. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.

Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data. Most of my colleagues are competent with Linux, fewer know R.

(Updated) A method that I would like to use before long is to pair mmap with a text file and examine the data in situ, rather than read it at all into RAM. I have done this with C, and it can be blisteringly fast.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • Your 'blisteringly fast' remark reminded me about `:=` in [data.table](http://datatable.r-forge.r-project.org/), see my answer just added. – Matt Dowle Aug 16 '11 at 11:07
  • Glad you answered! I was looking for a data.table solution. Blisteringly fast always beats blazing fast. ;-) – Iterator Aug 16 '11 at 11:13
3

Sometimes I like to do this using column ids instead.

df <- data.frame(a=rnorm(100),
b=rnorm(100),
c=rnorm(100),
d=rnorm(100),
e=rnorm(100),
f=rnorm(100),
g=rnorm(100)) 

as.data.frame(names(df))

  names(df)
1         a
2         b
3         c
4         d
5         e
6         f
7         g 

Removing columns "c" and "g"

df[,-c(3,7)]

This is especially useful if you have data.frames that are large or have long column names that you don't want to type. Or column names that follow a pattern, because then you can use seq() to remove.

RE: Your edit

You don't necessarily have to put "" around a string, nor "," to create a character vector. I find this little trick handy:

x <- unlist(strsplit(
'A
B
C
D
E',"\n"))
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Mmm, far point. I had/hinted at that in the question. I don't like this given that if your original data changes you have to rework your script. If someone re-exports the dataset from a program with a new second column, all your references are then out. – nzcoops Aug 16 '11 at 05:59
  • You could always save the names of the removal columns to a vector and then that wouldn't matter,`df[,-c(character_vector)]` – Brandon Bertelsen Aug 16 '11 at 06:05
  • Yup. Have added and 'additional' bit to the question to address that. In hindsight that probably should have been the question. – nzcoops Aug 16 '11 at 06:15
2

From http://www.statmethods.net/management/subset.html

# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL

Thought it was really clever make a list of "not to include"

jesusgarciab
  • 129
  • 11
1

Can use setdiff function:

If there are more columns to keep than to delete: Suppose you want to delete 2 columns say col1, col2 from a data.frame DT; you can do the following:

DT<-DT[,setdiff(names(DT),c("col1","col2"))]

If there are more columns to delete than to keep: Suppose you want to keep only col1 and col2:

DT<-DT[,c("col1","col2")]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Bindu
  • 11
  • 2
1

If you have a vector of names already,which there are several ways to create, you can easily use the subset function to keep or drop an object.

dat2 <- subset(dat, select = names(dat) %in% c(KEEP))

In this case KEEP is a vector of column names which is pre-created. For example:

#sample data via Brandon Bertelsen
df <- data.frame(a=rnorm(100),
                 b=rnorm(100),
                 c=rnorm(100),
                 d=rnorm(100),
                 e=rnorm(100),
                 f=rnorm(100),
                 g=rnorm(100))

#creating the initial vector of names
df1 <- as.matrix(as.character(names(df)))

#retaining only the name values you want to keep
KEEP <- as.vector(df1[c(1:3,5,6),])

#subsetting the intial dataset with the object KEEP
df3 <- subset(df, select = names(df) %in% c(KEEP))

Which results in:

> head(df)
            a          b           c          d
1  1.05526388  0.6316023 -0.04230455 -0.1486299
2 -0.52584236  0.5596705  2.26831758  0.3871873
3  1.88565261  0.9727644  0.99708383  1.8495017
4 -0.58942525 -0.3874654  0.48173439  1.4137227
5 -0.03898588 -1.5297600  0.85594964  0.7353428
6  1.58860643 -1.6878690  0.79997390  1.1935813
            e           f           g
1 -1.42751190  0.09842343 -0.01543444
2 -0.62431091 -0.33265572 -0.15539472
3  1.15130591  0.37556903 -1.46640276
4 -1.28886526 -0.50547059 -2.20156926
5 -0.03915009 -1.38281923  0.60811360
6 -1.68024349 -1.18317733  0.42014397

> head(df3)
        a          b           c           e
1  1.05526388  0.6316023 -0.04230455 -1.42751190
2 -0.52584236  0.5596705  2.26831758 -0.62431091
3  1.88565261  0.9727644  0.99708383  1.15130591
4 -0.58942525 -0.3874654  0.48173439 -1.28886526
5 -0.03898588 -1.5297600  0.85594964 -0.03915009
6  1.58860643 -1.6878690  0.79997390 -1.68024349
            f
1  0.09842343
2 -0.33265572
3  0.37556903
4 -0.50547059
5 -1.38281923
6 -1.18317733
1

The select() function from dplyr is powerful for subsetting columns. See ?select_helpers for a list of approaches.

In this case, where you have a common prefix and sequential numbers for column names, you could use num_range:

library(dplyr)

df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4)
df1 %>%
  select(num_range("col", c(1, 4)))
#>   col1 col4
#> 1    1    4

More generally you can use the minus sign in select() to drop columns, like:

mtcars %>%
   select(-mpg, -wt)

Finally, to your question "is there an easy way to create a workable vector of column names?" - yes, if you need to edit a list of names manually, use dput to get a comma-separated, quoted list you can easily manipulate:

dput(names(mtcars))
#> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
#> "gear", "carb")
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
1

Just addressing the edit.

@nzcoops, you do not need the column names in a comma delimited character vector. You are thinking about this the wrong way round. When you do

vec <- c("col1", "col2", "col3")

you are creating a character vector. The , just separates arguments taken by the c() function when you define that vector. names() and similar functions return a character vector of names.

> dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
> dat
  col1 col2 col3
1    1    1    1
2    2    2    2
3    3    3    3
> names(dat)
[1] "col1" "col2" "col3"

It is far easier and less error prone to select from the elements of names(dat) than to process its output to a comma separated string you can cut and paste from.

Say we want columns col1 and col2, subset names(dat), retaining only the ones we want:

> names(dat)[c(1,3)]
[1] "col1" "col3"
> dat[, names(dat)[c(1,3)]]
  col1 col3
1    1    1
2    2    2
3    3    3

You can kind of do what you want, but R will always print the vector the screen in quotes ":

> paste('"', names(dat), '"', sep = "", collapse = ", ")
[1] "\"col1\", \"col2\", \"col3\""
> paste("'", names(dat), "'", sep = "", collapse = ", ")
[1] "'col1', 'col2', 'col3'"

so the latter may be more useful. However, now you have to cut and past from that string. Far better to work with objects that return what you want and use standard subsetting routines to keep what you need.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    the reason I do this is because I don't like using number references. If someone comes back to you and had forgotten a column so re-exported the data file with a new second column, then you have to go through and make changes to your c(1,3) line. – nzcoops Aug 16 '11 at 23:51
  • @nzcoops I can understand that. However, I've gotten into the habit of checking all my code if the data change so updating the column numbers I want would be but one of the things I checked. I forgot to mention that, if you wrap each of the two `paste()` calls in `writeLines()` R will write the string to the console without the wrapping `"`, which makes the first `paste()` example most closely represent what you asked for. – Gavin Simpson Aug 17 '11 at 08:25
0

rm in within can be quite useful.

within(mtcars, rm(mpg, cyl, disp, hp))
#                     drat    wt  qsec vs am gear carb
# Mazda RX4           3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4
# Datsun 710          3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2
# Valiant             2.76 3.460 20.22  1  0    3    1
# ...

May be combined with other operations.

within(mtcars, {
  mpg2=mpg^2
  cyl2=cyl^2
  rm(mpg, cyl, disp, hp)
  })
#                     drat    wt  qsec vs am gear carb cyl2    mpg2
# Mazda RX4           3.90 2.620 16.46  0  1    4    4   36  441.00
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4   36  441.00
# Datsun 710          3.85 2.320 18.61  1  1    4    1   16  519.84
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1   36  457.96
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2   64  349.69
# Valiant             2.76 3.460 20.22  1  0    3    1   36  327.61
# ...
jay.sf
  • 60,139
  • 8
  • 53
  • 110