0

What is the best approach to combine multiple MySQL tables in R? For instance, I need to rbind 14 large `MySQL tables (each >100k rows by 100 columns). I tried the below approach, which consumed most of my memory and got time out from MySQL. I am wondering if there is alternative solution? I do not need to fetch the whole table, just need group the whole table by a couple of variables and calculate some metrics.

station_tbl_t <- dbSendQuery(my_db, "select * from tbl_r3_300ft
                  union all
                  select * from tbl_r4_350ft
                  union all
                  select * from tbl_r5_400ft
                  union all
                  select * from tbl_r6_500ft
                  union all
                  select * from tbl_r7_600ft
                  union all
                  select * from tbl_r8_700ft
                  union all
                  select * from tbl_r9_800ft
                  union all
                  select * from tbl_r10_900ft
                  union all
                  select * from tbl_r11_1000ft
                  union all
                  select * from tbl_r12_1200ft
                  union all
                  select * from tbl_r13_1400ft
                  union all
                  select * from tbl_r14_1600ft
                  union all
                  select * from tbl_r15_1800ft
                  union all
                  select * from tbl_r16_2000ft
                  ")
TTT
  • 4,354
  • 13
  • 73
  • 123
  • 1
    Use a where clause on each select to limit the rows returned, and a group by if you need to summarize the data (along with SUM, MAX, etc.) function to combine rows in the results. – Sloan Thrasher Jul 04 '17 at 23:24
  • @SloanThrasher, thx for the suggestion. So this has to be done through `dbSendQuery`, no `dplyr` function, right? – TTT Jul 04 '17 at 23:25
  • This could be done with dplyr/dbplyr (see the recent RStudio [blog post](https://blog.rstudio.org/2017/06/27/dbplyr-1-1-0/) for a starting point). But I'd keep using your approach of building the smaller desired dataset through SQL (after incorporating @SloanThrasher's advice). Don't forget to limit columns by listing them explicitly (instead of using the `*`). – wibeasley Jul 04 '17 at 23:33
  • @wibeasley, thanks for the tip! – TTT Jul 04 '17 at 23:36

1 Answers1

2

Consider iteratively importing MySQL table data and then row bind with R. And be sure to select needed columns to save on overhead:

tbls <- c("tbl_r3_300ft", "tbl_r4_350ft", "tbl_r5_400ft", 
          "tbl_r6_500ft", "tbl_r7_600ft", "tbl_r8_700ft", 
          "tbl_r9_800ft", "tbl_r10_900ft", "tbl_r11_1000ft", 
          "tbl_r12_1200ft", "tbl_r13_1400ft", "tbl_r14_1600ft", 
          "tbl_r15_1800ft", "tbl_r16_2000ft")

sql <- "SELECT Col1, Col2, Col3 FROM" 

dfList <- lapply(paste(sql, tbls), function(s) {
             tryCatch({ return(dbGetQuery(my_db, s)) 
                      }, error = function(e) return(as.character(e)))
          })

# ROW BIND VERSIONS ACROSS PACKAGES
master_df <- base::do.call(rbind, dfList)
master_df <- plyr::rbind.fill(dfList)
master_df <- dplyr::bind_rows(dfList)
master_df <- data.table::rbindlist(dfList)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • got this error from the `lapply` statement, `Error in .local(conn, statement, ...) : connection with pending rows, close resultSet before continuing` – TTT Jul 05 '17 at 18:06
  • Whoops! I forgot a closing parenthesis. Try again. We may need to use a `tryCatch()`. – Parfait Jul 05 '17 at 18:27
  • thanks for the suggestion. still got the same error. maybe it is due to `lapply`? So need to keep checking the connection in the loop? – TTT Jul 05 '17 at 19:13
  • Ah-ha! It is actually `dbSendQuery()` where we require a `fetch()` afterwards. See edit with `dbGetQuery()` (which does [all](https://stackoverflow.com/a/21167526/1422451)) assuming you use RMySQL. I also add a `tryCatch()`. Before you row bind, assess the *dfList* to see if all tables came through. – Parfait Jul 05 '17 at 19:41
  • Awesome! So all tables binded for a >1 million+ row dataframe? And in quick amount of time? Curious, did you use `do.call()`? – Parfait Jul 05 '17 at 20:51
  • @Pafait, it took a while (~40 mins) to get columns and it is beyond my memory limit if wants to get 4. But the approach works. – TTT Jul 05 '17 at 22:33
  • I just helped another R user. If you have a 64-bit machine, you can increase memory: `memory.limit( size=56000)`. Expand as needed. Additionally, close other apps and try using the command-line Rscript over RStudio. Even check the row bind on those other packages which may be faster. – Parfait Jul 06 '17 at 00:37
  • I though `memory.limit( size=56000)` will not go beyond its physical memory, right? I am using Rterm. – TTT Jul 06 '17 at 02:29
  • 1
    Adjust as needed. You will be warned if memory request cannot be processed. – Parfait Jul 06 '17 at 02:48
  • Thanks and BTW using MySQL here is due to memory limitation. – TTT Jul 06 '17 at 02:51