0

Background

I am having a problem with batch processing, of an R script (approx. 1000 lines) which requires calls to a SQL backend to pick up data. The data is pulled into R from SQL via odbc, and I do have writing privileges to the backend. I have stored as much as possible in the backend to reduce local memory usage.

However, an issue is a caused by a few memory intensive processes, such as a huge number of correlations, and imputations for time-series structures. When the script is applied to the entire dataset, this takes up a lot of local memory, that is not discounting the size of the database in general just to pull the data.

Script and data structures

My script is designed so that batches of 20 'item groups', with up to 100 items in each group, are used to get the time-series data for each item in the group. This data is are stored as a nested dataframe relative to the 'item group' (see image below), and is used throughout the script to facilitate parallel processing on the selected batch.
enter image description here

Obviously if I expand the nested table to allow 2000+ 'item groups', I will crash my computer just loading the data, before the memory intensive processes are applied. So I though the best approach may be to create a grouping variable in the backend table (numbered 1 to 200) for each chunk/batch I want to call into the R script.

The Question

I can also use the backend to work down the chunking/batch group variable, removing groups as they are processed. My problem is now how to make the script run until all the chucks have been processed, either directly in the R script or in shell. What is the best approach for this? Is there any examples of working through a database like this.

I appreciate that this is more of an approach question than coding, but I would be happy with any coding examples that may be useful as well as approach advice. I am new to productising R code so anything would be helpful.

Thank you for reading.

FYI I do not have the luxury of using cloud computing at this time.

Comte
  • 159
  • 1
  • 10
  • Why the [tag:shell] tag? – r2evans Mar 18 '22 at 11:09
  • 2
    I think I understand your dilemma, but I don't know how we can help, we don't know the data or your code. You talk about "memory intensive processes": the fact that you load this frame with 2000 groups and 15-45k rows each suggests that you think you need all of them in R at a time. What is it about your code that means you must load all of that at once? – r2evans Mar 18 '22 at 11:15
  • Sorry I may not have been explicit enough, I'm saying I cannot load 2000 groups, as it will crash, which is why I want to make the code do batch/chunks. I do, however, need to process 2000+ groups. – Comte Mar 18 '22 at 11:23
  • Then perhaps just load 5-10 at a time, do each chunk, then upload/save the results, delete the variables, and lather/rinse/repeat. – r2evans Mar 18 '22 at 11:29
  • @r2evans I removed the shell tag, I thought it was relevant to calling the script multiple times. I'm not looking for a specific code example more the best framework, and any example code that would help conceptualise that. For example could I loop over the a script while checking the backend. – Comte Mar 18 '22 at 11:29
  • A {mcve] is a great start when asking SQL questions. – jarlh Mar 18 '22 at 11:31
  • @r2evans I am thinking of keeping the chunks as 20 item groups, but that would be 100+ calls of the script. So is not practical in production unless I can automate (maybe by looping) by using the chunk grouping variable in the database. – Comte Mar 18 '22 at 11:32
  • 1
    I'm really not certain how to help you, Comte. Can you execute your script on a computer/server with more RAM? Is your database SQL Server with R-Services enabled? Can any of your "memory intensive processes" be translated to raw SQL? We know _nothing_ here. I understand it's a hard problem, and it's hard to convey it well in a forum like this, but that does not make it any less unlikely to answer knowing nothing about the data or your processes. – r2evans Mar 18 '22 at 11:38
  • Firstly thank you for your responses, I certainly realise now this could have been framed much better, and I may need to rework the question. I expect to dock the script later (server side), but right now I need to be able to run it locally (16gb ram). I did not realise you could enable SQL to use R, very useful information. Although I couldn't use that now, I need to call the data to be processed from SQL in to R, and I think I want to create a temporary table that monitors the chunks. So I guess I some how need to loop of the script, I think I may have found the answer. – Comte Mar 18 '22 at 12:18
  • I believe I want something like this, calling my script as a function, that way I can iterate over it and the SQL table with the variables in it. I will update the question and answer afterwards. Thank you for your thoughts, I will certainly try to be better in questions in the future. https://stackoverflow.com/questions/41729469/r-script-as-a-function – Comte Mar 18 '22 at 12:21

1 Answers1

1

For iterating over large datasets, I use the modulus of a numeric column.

For example, the following code will subset to (approximately) one tenth of your dataset:

large_remote_table = tbl(db_con, ...)

subset_remote_table = large_remote_table %>%
  filter(numeric_column %% 10 == 0)

What should you use as a numeric column? If your dataset already has some kind of numeric identities, then these are a good choice. Otherwise, I recommend you modify your input dataset to include a row number column. Many SQL flavors have a build in row number function.

We can then iterate over all the subsets using a for loop:

number_subsets = 2000

large_remote_table = tbl(db_con, ...)

output_list = list()

for(ii in 1:number_subsets){
  this_subset = large_remote_table %>%
    filter(row_number %% number_subsets == ii)

  # further processing of subset

  # store result
  output_list[ii] = results_from_processing_of_subset
}
Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41