2

I want to delete all columns of a dataframe which contain NA values only, within the dbplyr syntax.
With the normal R syntax this is no problem, see here: Remove columns from dataframe where ALL values are NA
Here is a possibility in R:

library(tidyverse)
library(DBI)
library(dbplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
# desired output:
# A tibble: 5 × 2
      b     c
  <int> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1
5     5    NA

But now I want to do this in the dbplyr syntax. For this the following example:

# Creating test database
con <- dbConnect(RSQLite::SQLite(), "")
# Inserting test table
dbWriteTable(con, "df", df)

con %>% 
  tbl("df") %>% 
  purrr::discard(~all(is.na(.))) %>% 
  collect()

#Error in UseMethod("collect") : 
# no applicable method for 'collect' applied to an object of class "list"

Can anyone help me find a solution within dbplyr?

Phil
  • 7,287
  • 3
  • 36
  • 66
TobKel
  • 1,293
  • 8
  • 20

1 Answers1

3

Revised answer - remove columns where every value is NA

The idea here is check each column for the number of NAs and compare against the number of rows:

df = con %>% tbl("df")

num_rows = df %>%
  ungroup() %>%
  summarise(num = n()) %>%
  collect() %>%
  pull(num)

cols_to_remove = c()

for(col in colnames(df)){
  num_na = df %>%
    ungroup() %>%
    mutate(is_na = ifelse(is.na(!!sym(col)), 1, 0)) %>%
    summarise(num_na = sum(is_na)) %>%
    collect() %>%
    pull(num_na)

  if(num_na == num_rows){
    cols_to_remove = c(cols_to_remove, col)
  }
}

local_df = df %>%
  select(-all_of(cols_to_remove)) %>%
  collect()

Original answer - removes rows where every value is NA

Here is one approach:

df = con %>% tbl("df")

df = df %>% mutate(all_na = TRUE)

for(col in colnames(df)){
  df = df %>%
    mutate(all_na = ifelse(!is.na(!!sym(col)), FALSE, all_na))
}

df %>%
  filter(all_na != TRUE) %>%
  select(-all_na)

The idea is to create a new column to store whether NAs are found in the row. By iterating through all the rows and updating this column, we create an indicator column for the presence of all NA values.


Syntax explanation

!!sym(.) is an rlang feature. It converts from the variable col containing the string "my_column" to the column name my_column.

Instead of using !!sym(.) inside select we use all_of for a similar purpose.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • The code does not work for me. The column "a" of the example dataset is not filtered out. – TobKel Jul 05 '23 at 12:57
  • 1
    Sorry, I misread your original question and was filtering out rows. New answer added to filter out columns. – Simon.S.A. Jul 05 '23 at 20:44
  • Now the answer matches my intention, but the `select` command is placed at the end. I want it before the `collect` command. – TobKel Jul 06 '23 at 12:24
  • Edited to add a final collect to the answer. Note that the first two `collect`s (followed by `pull`) are just bringing the row counts into R so they can be used in the `if` condition. – Simon.S.A. Jul 06 '23 at 20:29