-1

I am working with the R programming language. I have the following table that is located on a server:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

head(testframe)

  age height height2 gender gender2
1  18   76.1    76.1      M       M
2  19   77.0    77.0      F       F
3  20   78.1    78.1      M       M
4  21   78.2    78.2      M       M
5  22   78.8    78.8      F       F
6  23   79.7    79.7      F       F

My goal is to remove columns from the above tables that are identical, even if they have different names. Thus, the end product would look like this:

  age height gender
1  18   76.1      M
2  19   77.0      F
3  20   78.1      M
4  21   78.2      M
5  22   78.8      F
6  23   79.7      F

My Problem: I want to remove these duplicate columns from the table directly on the server. I have been reading about the "dbplyr" library in R, and I am not sure if this is able to do this. I tried the two following methods:

Method 1:

library(RODBC)
library(sqldf)
library(dplyr)
library(dbplyr)
library(odbc)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, testframe)



final_1 = testframe %>%
  dplyr::select(., which(as.list(.) %>%
                           duplicated %>%
                           `!`))

Method 2:

mtcars2 <- tbl(con, "testframe")


final_2 = mtcars2 %>%
    dplyr::select(., which(as.list(.) %>%
                               duplicated %>%
                               `!`))

In the above examples, the data ("testframe") is in my local environment, but I tried to use the copy_con statement in an attempt to see if these statements would work if the data was on a server.

My Question: The code from the above examples seemed to have worked - but I am not sure if the above code can only perform these statements on data in the local environment - or it can also perform these same statements on data located on a server. Can someone please comment on this?

Thanks!

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 2
    If I'm understanding correctly, you cannot use the `dplyr` verbs (as part of `dbplyr`) to change your remote database data. I believe this is by design, keep the source immutable. Other options would be to manipulate the data as you have done and then save the resulting data.frame using `copy_to` or `dbWriteTable`, or using different custom SQL statements such as `ALTER` or `UPDATE` (depending on what your exact needs are), or something like that. – Ben Dec 12 '21 at 14:17
  • 2
    Also, as you have in your other example, you can review the SQL statements with `show_query` and see what is being done. The `SELECT` with `AS` and `FROM` won't change any database data, but only retrieve column data with aliases. Which is what it would be designed to do. – Ben Dec 12 '21 at 14:19
  • @ Ben: thank you so much for your reply! This makes sense - you can not directly overwrite the original table - but you can make a copy of the table, and then use "dplyr verbs" to manipulate this table directly? – stats_noob Dec 13 '21 at 15:19
  • https://stackoverflow.com/questions/70231544/re-writing-fuzzy-join-functions-from-r-to-sql I wonder if there is a way to do "fuzzy joins" in R so that they can be run the same way? – stats_noob Dec 13 '21 at 15:20

1 Answers1

1

dbplyr works by translating dplyr commands into the corresponding database language, before passing these commands to the database. I strongly suspect it will struggle to translate which(as.list(.) %>% duplicated %>% '!') as you are asking it to translate the meaning of these instructions, not a literal command-for-command translation.

You can test dbplyr translation using simulated database connections. For example:

library(dplyr)
library(dbplyr)

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

remote_table = tbl_lazy(testframe , con = simulate_mssql()) # the simulation

remote_table %>%
  mutate(height3 = height2) %>%
  select(height, height2, height3) %>%
  show_query()

If this produces valid database code then you know that your dplyr commands are working as intended. But if this produces unexpected database commands (or R commands appear in the translation) then dbplyr has not been able to translate your instructions into the database language.

For this type of problem I might take an approach like the following:

all_column_names = colnames(remotetable)
num_cols = length(all_column_names)
is_dupe = rep(0, num_cols)

for(ii in 1:(num_cols - 1)){
  for(jj in 2:num_cols) {
    colA = all_column_names[ii]
    colB = all_column_names[jj]

    this_check = remotetable %>%
      mutate(compare = ifelse(!!sym(colA) == !!sym(colB), 1, 0)) %>%
      ungroup() %>%
      summarise(num = n(), matches = sum(compare)) %>%
      collect()

    if(this_check$num == this_check$matches)
      is_dupe[jj] = 1
  }
}

The idea is to iterate through all pairs of columns, and using basic dplyr commands check if every value is equal. If the number of equal values is equal to the total number of values then match the column as a duplicate. You can then use the output to select the non-duplicate columns.

Notes

  • !!sym(colA) takes the text stored inside that variable colA and turns it into a column name that dbplyr can handle.
  • collect() is the command to pull data from the database into local R memory. Only do this once you are confident that the data will fit in local memory.
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • @ Simon S.A. : thank you so much for your answer! In the second chunk of code you have written - when you launch "this_check", is "this_check" being run over the server? You mentioned that the "collect()" command will pull the data into local R memory. Is there a way to change this so that "this_check" can be created as a table on the database ... and once you are satisfied with the results, then you can use the collect() statement? For this, do you simply remove the "collect()" statement... and then "check_this" will be created as a new table on the server? Thank you so much! – stats_noob Dec 13 '21 at 15:27
  • 1
    Database tables are not editable in the same way that tables in R memory are. One option is to write an entire new database table (see [this](https://stackoverflow.com/a/67577301/7742981) answer). But this is not essential, you can interact with `this_check` before calling `collect()` and it will behave like a table (the output of a database query is a table even if it is produced dynamically rather than saved to disk). – Simon.S.A. Dec 13 '21 at 20:22
  • @ Simon S.A. : Thank you so much for your reply! I am working on another related question (bountied) - could you please take a look at it if you have time? https://stackoverflow.com/questions/70231544/re-writing-fuzzy-join-functions-from-r-to-sql Thank you so much for all your help! – stats_noob Dec 13 '21 at 20:42