Dropping containing NA rows with dbplyr

Question

here is how I ran some SQL queries by dbplyr

library(tidyverse)
library(dbplyr)
library(DBI)
library(RPostgres)
library(bit64)
library(tidyr)

drv <- dbDriver('Postgres')

con <- dbConnect(drv,dbname='mydb',port=5432,user='postgres')

table1 <- tbl(con,'table1')
table2 <- tbl(con,'table2')
table3 <- tbl(con,'table3')

table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))

I wanna drop some rows which include NA then collect my final table but couldn't find anything helpful works with dbplyr queries.

I tried to pipe drop_na() from tidyr and some other base functions (complete.cases() etc.). Would you suggest me anything to succeed my aim ? Piping an SQL query (like WHERE FOO IS NOT NULL) to dbplyr query is also welcome.

Thanks in advance.

@PaulG Hi, what does `table` refer to? if you suggest me to pipe `na.omit()` at the end of my dbplyr query, I tried this and it is one of the disappointments. I wanna omit NA's before collecting because it creates a huge size of temporary file in my disk. — Samet Sökel, Mar 24 '21 at 09:19
`table` should be a dataframe or tibble. What class is your resulting table? Maybe try converting it first. — PaulG, Mar 24 '21 at 10:16
@PaulG well, actually it is not any of what you said yet. it was a lazy query classed 'tbl_lazy', 'tbl_sql' and 'tbl'. the problem begins there, I need to do whatever I gotta do before make it a dataframe or tibble etc. — Samet Sökel, Mar 24 '21 at 10:20
The SQL query [in this answer](https://dba.stackexchange.com/questions/143959/delete-from-table-rows-where-any-of-the-column-field-is-null) seems to be doing what you need. — PaulG, Mar 24 '21 at 14:40

score 3 · Answer 1 · answered Mar 26 '21 at 06:54

Try using !is.na(col_name) as part of a filter:

library(dplyr)
library(dbplyr)

df = data.frame(my_num = c(1,2,3))
df = tbl_lazy(df, con = simulate_mssql())

output = df %>% filter(!is.na(my_num))

Calling show_query(output) to check the generated sql gives:

<SQL>
SELECT *
FROM `df`
WHERE (NOT(((`my_num`) IS NULL)))

The extra brackets are part of how dbplyr does its translation.

If you want to do this for multiple columns, try the following approach based on this answer:

library(rlang)
library(dplyr)
library(dbplyr)

df = data.frame(c1 = c(1,2,3), c2 = c(9,8,7))
df = tbl_lazy(df, con = simulate_mssql())

colnames = c("c1","c2")
conditions = paste0("!is.na(",colnames,")")

output = df %>%
  filter(!!!parse_exprs(conditions))

Calling show_query(output) shows both columns appear in the generated query:

<SQL>
SELECT *
FROM `df`
WHERE ((NOT(((`c1`) IS NULL))) AND (NOT(((`c2`) IS NULL))))

yeah, seems more generic and easy-to-read. thanks for this precious work! — Samet Sökel, Mar 26 '21 at 07:34

Samet Sökel · Accepted Answer · 2021-03-25T07:08:05.500

Well, actually I still don't get a satisfying solution. What I exactly wanted to do is to drop containing NA rows in R environment without typing an SQL query, I think dbplyr doesn't support this function yet.

Then I wrote a little and simple code to make my wish come true;

main_query<-table1 %>% mutate(year=as.integer64(year)) %>% left_join(table2,by=c('id'='id')) %>%
left_join(table3,by=c('year'='year'))

colnames <- main_query %>% colnames

query1 <- main_query %>% sql_render %>% paste('WHERE')

query2<-''


for(i in colnames){

    if(i == tail(colnames,1)){query2<-paste(query2,i,'IS NOT NULL')}
    
    else{query2<-paste(query2,i,'IS NOT NULL AND')}

}

desiredTable <- dbGetQuery(con,paste(query1,query2))

Yeah, I know it doesn't seem magical but maybe someone can make use of it.

Dropping containing NA rows with dbplyr

2 Answers2