0

I have a data.table. I want to remove those rows where all columns except certain 2 columns are NA. For example:

I have a data.table like:

> ww2
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
 1:          5.1         3.5          1.4         0.2  setosa     1
 2:          4.9         3.0          1.4         0.2  setosa     2
 3:          4.7         3.2          1.3         0.2  setosa     3
 4:          4.6         3.1          1.5         0.2  setosa     4
 5:          5.0         3.6          1.4         0.2  setosa     5
 6:          5.1         3.5          1.4         0.2 dffdsdf     1
 7:          4.9         3.0          1.4         0.2 dffdsdf     2
 8:          4.7         3.2          1.3         0.2 dffdsdf     3
 9:           NA          NA           NA          NA dffdsdf     4
10:           NA          NA           NA          NA dffdsdf     5

Its dput is:

    structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.1, 4.9, 
4.7, NA, NA), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.5, 3, 
3.2, NA, NA), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.4, 
1.4, 1.3, NA, NA), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 
0.2, 0.2, NA, NA), Species = structure(c(1L, 1L, 1L, 1L, 1L, 
4L, 4L, 4L, 4L, 4L), class = "factor", .Label = c("setosa", "versicolor", 
"virginica", "dffdsdf")), index = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 
3L, 4L, 5L)), .Names = c("Sepal.Length", "Sepal.Width", "Petal.Length", 
"Petal.Width", "Species", "index"), row.names = c(NA, -10L), class = "data.frame")

In above data table I want to remove row number 9 and 10. Since my actual data table is really big and has a lot more columns, it is difficult to explicitly mention those columns which are NA. But the columns which are not NA are fixed (they are 2, and in this particular example they are index and Species.

I am looking for an efficient and fast solution to this.

user3664020
  • 2,980
  • 6
  • 24
  • 45
  • @Pascal it is not. There the person wanted remove those rows where any column is NA (does not matter if 1 column or 2 columns or 3 columns are NA). But here, I want to remove those rows where a fixed number of columns are NAs. – user3664020 Nov 09 '15 at 11:24
  • The `dput` gives an error Error: unexpected '<' in " " – Ronak Shah Nov 09 '15 at 11:25
  • So just specify them in `.SDcols`? For example `ww2[!ww2[, Reduce('&', lapply(.SD, is.na)), .SDcols = -(Species:index)]]` – David Arenburg Nov 09 '15 at 11:26
  • @user3664020 I am pretty sure the answer in the duplicate can be adapted. –  Nov 09 '15 at 11:26
  • @RonakShah i have edited the `dput` , please convert that to a `data.table`. using `dt <- data.table(df)` – user3664020 Nov 09 '15 at 11:29
  • @DavidArenburg `.SDcols` takes columns in what format? What if I want to pass 3 columns. `.SDcols = -c(Species, index, col3)` is not working. – user3664020 Nov 09 '15 at 11:33
  • Try `-c("Species", "index", "Sepal.Length")` Or if they are in order you can just do `-(Petal.Width:index)` for exmaple – David Arenburg Nov 09 '15 at 11:34
  • If this works for you we could probably close this as a dupe. – David Arenburg Nov 09 '15 at 11:54

1 Answers1

4

Given the data you provided, I would do something like:

library(dplyr)
na_rows = ww2 %>% 
            select(-Species, -index) %>% 
            is.na() %>% 
            rowSums() > 0

ww2 %>% 
  filter(!na_rows)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species index
1          5.1         3.5          1.4         0.2  setosa     1
2          4.9         3.0          1.4         0.2  setosa     2
3          4.7         3.2          1.3         0.2  setosa     3
4          4.6         3.1          1.5         0.2  setosa     4
5          5.0         3.6          1.4         0.2  setosa     5
6          5.1         3.5          1.4         0.2 dffdsdf     1
7          4.9         3.0          1.4         0.2 dffdsdf     2
8          4.7         3.2          1.3         0.2 dffdsdf     3

or more default R style (I like dplyr):

na_rows = rowSums(is.na(ww2[, .SD, .SDcols = !c('Species', 'index')]), with = FALSE])) > 0
ww2[!na_rows,]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149