7

I have a question similar to this Stackoverflow post.

How can I create a persistent table from a SQL query in a database (I use a DB2 database)? My goal is to use a table from one schema and to permanently create a more or less modified table in another schema.

What works so far is to pull the data to R and subsequently create a table in a different schema:

dplyr::tbl(con, in_schema("SCHEMA_A", "TABLE")) %>%
collect() %>% 
DBI::dbWriteTable(con, Id(schema = "SCHEMA_B", table = "NEW_TABLE"), ., overwrite = TRUE)

However, I'd like to incorporate the compute() function in a dplyr pipeline such that I do not have to pull the data into R, that is, I'd like keep the data on the database. As a side note: I do not know how I would substitute the DBI'sdbWriteTable() for dplyr's copy_to() – being able to do that would also help me.

Unfortunately, I am not able to make it work, even after reading ?compute() and its online documentation. The following code framework does not work and results in an error:

dplyr::tbl(con, in_schema("SCHEMA_A", "TABLE")) %>%
dplyr::compute(in_schema("SCHEMA_B", "NEW_TABLE"), analyze = FALSE, temporary = FALSE)

Is there a solution for using compute() or some other solution applicable to a dplyr pipeline?

MatSchu
  • 383
  • 1
  • 13
  • the compute you are trying to use is a function from `dbplyr`, not from `dplyr` – Waldi Jun 19 '20 at 09:58
  • Ah ok, when I type `?compute()` the dplyr documentation pops up. – MatSchu Jun 19 '20 at 10:07
  • see the [online documentation](https://dbplyr.tidyverse.org/dev/reference/collapse.tbl_sql.html) link you sent. dbplyr will try to reformulate your dplyr query into a SQL query executed on DB server. – Waldi Jun 19 '20 at 10:27
  • I changed `dplyr::compute(in_schema("SCHEMA_B", "NEW_TABLE"), analyze = FALSE, temporary = FALSE)` to `dbplyr::compute(in_schema("SCHEMA_B", "NEW_TABLE"), analyze = FALSE, temporary = FALSE)` (i. e., `dplyr::` to `dbplyr::`). The following error occurs: `Error: 'compute' is not an exported object from 'namespace:dbplyr'`. – MatSchu Jun 19 '20 at 10:53

2 Answers2

6

I use a custom function that takes the SQL query behind a remote table, converts in into a query that can be executed on the SQL server to save a new table, and then executes that query using the DBI package. Key details below, full details (and other functions I find useful) in my GitHub repository here.

write_to_database <- function(input_tbl, db_connection, db, schema, tbl_name){
  # SQL query
  sql_query <- glue::glue("SELECT *\n",
                          "INTO {db}.{schema}.{tbl_name}\n",
                          "FROM (\n",
                          dbplyr::sql_render(input_tbl),
                          "\n) AS from_table")
  
  # run query
  DBI::dbExecute(db_connection, as.character(sql_query))
}

The essence of the idea is to construct an SQL query that if you executed it in your database language directly, would give you the desired outcome. In my application this takes the form:

SELECT *
INTO db.schema.table
FROM (
  /* sub query for existing table */
) AS alias

Note that this is using SQL server, and your particular SQL syntax might be different. INTO is the SQL server pattern for writing a table. In the example linked to in the question, the syntax is TO TABLE.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Can you incorporate your custom in a dplyr pipeline such that your custom function appears at end of a dplyr pipeline? – MatSchu Jun 23 '20 at 04:46
  • Of course, the pipe `%>%` just replaces the first argument: `x %>% function(y)` is the same as `function(x,y)`. So `remote_table %>% some_manipulations %>% write_to_database(db_con, db, schema, tbl_name)` will work. – Simon.S.A. Jun 23 '20 at 20:35
  • Cool, I will try to implement your approach soon. Today, I could manage to make `DBI::dbExecute()` work with DB2's SQL syntax, which seems to be different from SQL server syntax. So far, I have accessed the database only via R's `dplyr` and `dbplyr` or SAS' `proc sql`. I will post some code here once I'm (hopefully) done. – MatSchu Jun 24 '20 at 18:17
4

Thanks to @Simon.S.A., I could solve my problem. As he showed in his reply, one can define a custom function and incorporate it in a dplyr pipeline. My adapted code looks like this:

# Custom function

write_to_database <- function(input_tbl, db_connection, schema, tbl_name){
  
  # SQL query

  sql_query <- glue::glue("CREATE TABLE {schema}.{tbl_name} AS (\n",
                      "SELECT * FROM (\n",
                      dbplyr::sql_render(input_tbl),
                      "\n)) WITH DATA;")

  # Drop table if it exists
  
  DBI::dbExecute(con, glue::glue("BEGIN\n",
                                    "IF EXISTS\n",
                                      "(SELECT TABNAME FROM SYSCAT.TABLES WHERE TABSCHEMA = '{schema}' AND TABNAME = '{tbl_name}') THEN\n",
                                        "PREPARE stmt FROM 'DROP TABLE {schema}.{tbl_name}';\n",
                                        "EXECUTE stmt;\n", 
                                    "END IF;\n",
                                 "END"))
  
  # Run query

  DBI::dbExecute(db_connection, as.character(sql_query))
}

# Dplyr pipeline

dplyr::tbl(con, in_schema("SCHEMA_A", "SOURCE_TABLE_NAME")) %>%
  dplyr::filter(VARIABLE == "ABC") %>% 
  show_query() %>% 
  write_to_database(., con, "SCHEMA_B", "NEW_TABLE_NAME")

It turns out that DB2 appears to not know DROP TABLE IF EXISTS such that some additional programming is necessary. I used this Stackoverflow post to get it done. Furthermore, in my case, I do not need to specify the database explicitly such that the parameter db in the custom function is left out.

MatSchu
  • 383
  • 1
  • 13