3

I have a 50GB SQLite database file and I want to calculate and add new variables. Can you leverage Moody_Mudskipper's function or something using ALTER TABLE and UPDATE to create the variable instead of an entire table?

library(dbplyr)
    library(DBI)
    con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
    copy_to(con, head(iris,3),"iris")

create <- function(data,name){
   DBI::dbSendQuery(data$src$con,
                    paste("CREATE TABLE", name,"AS", dbplyr::sql_render(data)))
                             }

tbl(con,"iris") %>% 
   mutate(Sepal.Area= Sepal.Length * Sepal.Width) %>% 
   create("iris_2")
Geet
  • 2,515
  • 2
  • 19
  • 42
  • 1
    I hadn't seen this but meanwhile updated my answer there: https://stackoverflow.com/questions/50766908/mutate-variables-in-database-tables-directly-using-dplyr and it seems to work quite nicely – moodymudskipper Jun 17 '18 at 11:33
  • Ok, thank you Moody_Mudskipper! – Geet Jun 17 '18 at 16:32

1 Answers1

5

Here is an easy solution using pool and DBI since you can directly write and execute any valid SQL statement.

library(DBI)
library(pool)
library(RSQLite)

#Database
#Create a connection
pool <- dbPool(drv =RSQLite::SQLite(),dbname="")
#Colse open connection when it is no longer needed, to prevent leaks
poolClose(pool)


dbWriteTable(conn = pool, name = "mtcars", value = mtcars)

insert_new_column <- function(table_name, column_name, column_type){
        query <- sqlInterpolate(pool,"ALTER TABLE ?table ADD COLUMN ?column ?type;",table=table_name, column=column_name, type=column_type)

        dbExecute(pool,query)

}

insert_new_column(table_name="mtcars", column_name="test", column_type="REAL")  #More types: INTEGER, TEXT

dbGetQuery(pool,"Select * from mtcars;")

head(dbReadTable(pool, "mtcars"))
   mpg cyl disp  hp drat    wt  qsec vs am gear carb test
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  NA
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  NA
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  NA
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

Then you can do:

tbl(pool,"mtcars") %>% 
    mutate(test= cyl * 2)

Update

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "mtcars", mtcars)
insert_new_column2 <- function(table_name, column_name, column_type){
      dbWithTransaction(
            con,{
                   dbExecute(con, sqlInterpolate(con,"ALTER TABLE ?table ADD COLUMN ?column ?type;",table=table_name, column=column_name, type=column_type))

                   dbExecute(con, sqlInterpolate(con,"UPDATE ?table SET ?column = cyl*2;",table=table_name, column=column_name))

  }) 
 }
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
  • When I close the connection and reconnect, the test shows NAs again. The mutate is not updating the values permanently. It gets updated with dbSendQuery(con, "UPDATE mtcars SET test=cyl*2"), but I need to do that through the function call. Can you help with that? – Geet Jun 17 '18 at 04:28
  • In fact, instead of UPDATE, can you try and make the effect of that mutate command permanent on the test variable? – Geet Jun 17 '18 at 04:39
  • I think `mutate` need an assignment to a new data frame as it does the calculation on the fly. Meanwhile, I add a new function for Update. – A. Suliman Jun 17 '18 at 05:41
  • Wow...This is AMAZING! Sorry for stretching this, but can the UPDATE include SQL equivalent of dplyr's "mtcars %>% group_by(carb, vs, am) %>% mutate(test= mean(disp, na.rm = T))" – Geet Jun 17 '18 at 05:58
  • Check [here](https://stackoverflow.com/questions/2480713/mysql-averaging-with-nulls), [here](https://stackoverflow.com/questions/17425011/get-avg-ignoring-null-or-zero-values) and [here](https://www.w3resource.com/sql/aggregate-functions/avg-function.php) – A. Suliman Jun 17 '18 at 06:07
  • 1
    tbl(pool,"mtcars") %>% group_by(carb, vs, am) %>% summarise(test= mean(disp, na.rm = T)) %>% show_query() helped! – Geet Jun 17 '18 at 06:29