Subset columns using logical vector

Question

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?

I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:

isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)

Then how can I continue to use this vector to subset dataframe?

Does this help? http://stackoverflow.com/questions/7381455/filtering-a-data-frame-by-values-in-a-column — user1477388, Jun 12 '14 at 18:49
Also, `sapply(dataframe, isNARateLt70)` is better than `apply` in this case so you don't have to convert to matrix first. — MrFlick, Jun 12 '14 at 19:01
Hi, Take a bit of time and read the tag excerpt before tagging. [tag:dataframes] is for pandas, whereas you need [tag:data.frame] here. Be careful the next time. See this meta post. [Warn \[r\] users from adding \[dataframes\] tag instead of \[data.frame\] tag](http://meta.stackoverflow.com/q/318933) — Bhargav Rao, Mar 14 '16 at 14:02

score 3 · Answer 1 · answered Jun 12 '14 at 18:59

If you have a data.frame like

dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))

#    a b  c  d  e  f  g
# 1 11 2  5  9  7  6 10
# 2 10 5 11 13 11 11  8
# 3 14 8  6 16  9 11  9
# 4 11 8 12  8 11  6 10

You can subset with a logical vector using one of

mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]

Rich Scriven · Accepted Answer · 2014-06-14T18:03:44.443

0

There's really no need to write a function when we have colMeans (thanks @MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).

Here's how I would approach your function if you want to use sapply on it later.

> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
                  z = c(rep(NA, 3), 1, 2))

> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
#     x     y     z 
# FALSE  TRUE  TRUE

Then, to subset with the above line your data using the above line of code, it's

> d[sapply(d, isNARateLt70)]

But as mentioned, colMeans works just the same,

> d[colMeans(is.na(d)) <= 0.7]
#    y  z
# 1  1 NA
# 2 NA NA
# 3 NA NA
# 4  1  1
# 5  1  2

edited Jun 14 '14 at 18:03

answered Jun 12 '14 at 19:01

Rich Scriven

97,041
11
181
245

1

or even `d[colMeans(is.na(d)) <= 0.7]` – MrFlick Jun 12 '14 at 19:02
Right you are. I suppose `<=` would be better. Haha, I just got out of a final exam. Brain's fried. – Rich Scriven Jun 12 '14 at 19:03

score 0 · Answer 3 · answered Jun 12 '14 at 19:01

Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.

> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed  dist
 TRUE  TRUE
> cars[1:10, columns]
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

Subset columns using logical vector

3 Answers3

Linked