44

I have a dataframe where some of the values are NA. I would like to remove these columns.

My data.frame looks like this

    v1   v2 
1    1   NA 
2    1    1 
3    2    2 
4    1    1 
5    2    2 
6    1   NA

I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.

data=subset(Itun, select=c(is.na(colMeans(Itun))))

I got an error,

error : 'x' must be an array of at least two dimensions

Can anyone give me some help?

lmo
  • 37,904
  • 9
  • 56
  • 69
TTT
  • 4,354
  • 13
  • 73
  • 123
  • Please add an example of what you would like to have as a result. It would also be really helpful to have a fully reproducible example. – BenBarnes Sep 17 '12 at 07:43

8 Answers8

71

The data:

Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA)) 

This will remove all columns containing at least one NA:

Itun[ , colSums(is.na(Itun)) == 0]

An alternative way is to use apply:

Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
47

Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.

library(dplyr)
Itun %>%
    select_if(~ !any(is.na(.)))
jiggunjer
  • 1,905
  • 1
  • 19
  • 18
Matt Dancho
  • 6,840
  • 3
  • 35
  • 26
  • I was wondering if you can extract the column names of the removed columns simultaneously. Is this possible? – Kots Dec 12 '17 at 17:10
  • 2
    I'd split that into two operations. Use `Itun %>% select_if(~ any(is.na(.))) %>% names()`. Then remove columns in second operation using code above. – Matt Dancho Dec 13 '17 at 19:55
  • great solution. for the cases that collumns should be remove that only have NAs you can use `select_if(~ !all(is.na(.))` – JdP Feb 23 '18 at 09:29
  • This solution is very nice but very slow. Itun[ , colSums(is.na(Itun)) == 0] by @Sven-hohenstein is much faster. – Matthias Munz Aug 13 '20 at 08:27
  • What does it return though if I wanted to have the columns that have NA/NULL ? When I ran the opposite (i.e. without `!`), it returned a bunch of columns that didn't have NAs; the column that had NA was returned along with them, though. – stucash Nov 15 '21 at 16:59
17

Alternatively, select(where(~FUNCTION)) can be used:

library(dplyr)

(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#>   x  y  z
#> 1 a NA  1
#> 2 b NA  2
#> 3 c NA  3
#> 4 d NA  4
#> 5 e NA NA

# Remove columns where all values are NA
df %>% 
  select(where(~!all(is.na(.))))
#>   x  z
#> 1 a  1
#> 2 b  2
#> 3 c  3
#> 4 d  4
#> 5 e NA
  
# Remove columns with at least one NA  
df %>% 
  select(where(~!any(is.na(.))))
#>   x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
HBat
  • 4,873
  • 4
  • 39
  • 56
13

You can use transpose twice:

newdf <- t(na.omit(t(df)))
Scott Worland
  • 1,352
  • 1
  • 12
  • 15
6
data[,!apply(is.na(data), 2, any)]
Backlin
  • 14,612
  • 2
  • 49
  • 81
  • Shouldn't the `data.frame` version be the same as the `matrix` version, just without the first comma? I get an error (`undefined columns selected`) with your code as it is. – A5C1D2H2I1M1N2O1R2T1 Sep 17 '12 at 07:44
  • 1
    However, `apply` converts the input to a matrix prior to applying the function, so I prefer to use `sapply` or `lapply` on data frames. Then again so does `is.na` so in this case the input is already a matrix and my first example was actually incorrect! Perhaps the conceptually nices solution is `sapply(data, function(x) !any(is.na(x)))`, but this is really nitpicking. – Backlin Sep 17 '12 at 08:05
2

A base R method related to the apply answers is

Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
  v1
1  1
2  1
3  2
4  1
5  2
6  1

Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).

lmo
  • 37,904
  • 9
  • 56
  • 69
2

Another alternative with the dplyr package would be to make use of the Filter function

Filter(function(x) !any(is.na(x)), Itun)

with data.table would be a little more cumbersome

setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
                                which(colSums(is.na(Itun))>0))]
Oriol Prat
  • 1,017
  • 1
  • 11
  • 19
0

You can also try:

df <- df[,colSums(is.na(df))<nrow(df)]