0

I am attempting to perform an unconventional filter of a dataframe. I am looking to filter the columns of a dataframe that have a particular value in a row

To do this the other way around is quite easy using filter in dplyr but I can't seem to figure it out filtering by column instead of the conventional filtering by row.

I have tried using the dplyr filter command but I suspect the command is not designed to perform this function.

filter(data,rowname >5)

This gives me an error

object 'rowname' not found

For example:

a<-c(1,0,NA,0,1)
b<-c(0,1,1,1,0)
c<-c(1,0,1,0,1)
d<-c(0,NA,1,NA,2)
df<-data.frame(a,b,c,d)
rownames(df) <- c(1:4,"sum_NA")
#filter/select by column
select(df, sum_NA>0)

Desired output:

        a  d
1       1  0
2       0 NA
3      NA  1
4       0 NA
sum_NA  1  2

If anyone has a workaround or a function designed for this, that would be fantastic.

dev
  • 25
  • 5
  • 2
    Please make a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and indicate what your desired output is. `filter()` works on rows the analog for columns is `select()`. – DaveArmstrong Aug 11 '23 at 00:23
  • @DaveArmstrong, Thanks! I have updated the question with an example and desired output – dev Aug 11 '23 at 00:59
  • 1
    Base R solution: `df[apply(df, 2, anyNA)]`. – neilfws Aug 11 '23 at 01:08
  • https://www.statisticalpoint.com/remove-columns-with-na-in-r/#:~:text=You%20can%20use%20one%20of%20the%20following%20two,%21any%20%28is.na%28.%29%29%29%20Both%20methods%20produce%20the%20same%20result. – M-- Aug 11 '23 at 04:25
  • This is a dupe because we already have questions addressing how to drop columns with NA. You are just asking the opposite of it, which can be achieved by dropping the `!` from the solutions in the dupe-target. – M-- Aug 11 '23 at 04:28

1 Answers1

1

I think you want select(where()) from dplyr. The input to where() is a function that returns a single TRUE or FALSE for each column. In your case, you want to keep any variable where one of the observations is missing. You could do that with the following:

library(dplyr)
a<-c(1,0,NA,0,1)
b<-c(0,1,1,1,0)
c<-c(1,0,1,0,1)
d<-c(0,NA,1,NA,2)
df<-data.frame(a,b,c,d)
rownames(df) <- c(1:4,"sum_NA")
select(df, where(function(x)any(is.na(x))))
#>         a  d
#> 1       1  0
#> 2       0 NA
#> 3      NA  1
#> 4       0 NA
#> sum_NA  1  2

Here are a few other examples that may be useful. First, what if you wanted only variables where there are at least two missing values, you could use function(x)sum(is.na(x)) >= 2

select(df, where(function(x)sum(is.na(x)) >= 2))
#>         d
#> 1       0
#> 2      NA
#> 3       1
#> 4      NA
#> sum_NA  2

If you wanted variables that have no missing values, you could use function(x)!any(is.na(x))

select(df, where(function(x)!any(is.na(x))))
#>        b c
#> 1      0 1
#> 2      1 0
#> 3      1 1
#> 4      1 0
#> sum_NA 0 1

If you wanted to keep all variables that contained a value greater than 1, you could use function(x)any(na.omit(x) > 1)

select(df, where(function(x)any(na.omit(x) > 1)))
#>         d
#> 1       0
#> 2      NA
#> 3       1
#> 4      NA
#> sum_NA  2

Created on 2023-08-11 with reprex v2.0.2

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
  • 2
    `df %>% select(where(anyNA))` also works. – neilfws Aug 11 '23 at 01:10
  • 1
    @neilfws indeed, it does, thanks for mentioning it. The solution where you write a function is more general, though. I included a few other examples that highlights this approach's versatility. – DaveArmstrong Aug 11 '23 at 01:17