Splitting a Dataset into Arbitrary Sections

Question

I have this data set:

var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)

sample_data = data.frame(var_1, var_2, var_3)

I would like to split this data set into 10 different datasets (each containing 100 rows) and then upload them on to a server.

I know how to do this by hand:

sample_1 = sample_data[1:100,]
sample_2 = sample_data[101:200,]
sample_3 = sample_data[201:300,]

# etc.

library(DBI)

#establish connection (my_connection)

dbWriteTable(my_connection,  SQL("sample_1"), sample_1)
dbWriteTable(my_connection,  SQL("sample_2"), sample_2)
dbWriteTable(my_connection,  SQL("sample_3"), sample_3)

# etc

Is there a way to do this "quicker"?

I thought of a general way to do this - but I am not sure how to correctly write the code for this:

i = seq(1:1000, by = 100)
j = 1 - 99
{
sample_i = sample_data[ i:j,]

dbWriteTable(my_connection,  SQL("sample_i"), sample_i)
}

Can someone please help me with this?

Thank you!

https://stackoverflow.com/questions/14164525/splitting-a-large-data-frame-into-smaller-segments — rdelrossi, Mar 09 '22 at 23:52
Thank you! I saw a similar command - but from here, how do you save each of these files individually (e.g. sample_1, sample_2, etc)? In the link you posted, the final output is a "list". Also, is it possible to bulk upload them at the same time? — stats_noob, Mar 09 '22 at 23:57
Sure, you can iterate over the list calling `dbWriteTable()` on each element. There's an example of using `map2` from the `purrr` package over at https://stackoverflow.com/questions/56604971/there-a-way-in-r-to-implement-the-apply-function-with-dbwritetable-in-dbi-librar — rdelrossi, Mar 09 '22 at 23:57
Thank you! I was just wondering, why do you need both "list_of_names" and "list_of_dataframes"? Couldn't you just use "lis_of_names"? map2(list_of_names, list_of_dataframes, function(x,y) dbWriteTable(conn_R, x, y)) — stats_noob, Mar 10 '22 at 00:03
Sorry, not following. Your pseudo code here, though, seems logical. `map2` iterates over two inputs simultaneously. One input can be the list of data table names and the other can be the list of data frames (that is, the output of the your earlier split). With `split()` and `map` you should have what you need to accomplish your goal. — rdelrossi, Mar 10 '22 at 00:10
I put a more complete example together as an answer. Perhaps that will help. — rdelrossi, Mar 10 '22 at 00:20

rdelrossi · Accepted Answer · 2022-03-10T00:25:06.370

2

Here's an example using the SQLite database engine. We'll start with your sample data set:

var_1 = rnorm(1000,1000,1000)
var_2 = rnorm(1000,1000,1000)
var_3 = rnorm(1000,1000,1000)

sample_data = data.frame(var_1, var_2, var_3)

Now we'll break your large data frame into a list of 10 data frames using split(). The result will be stored in a list:

list_of_dfs <- split(
  sample_data, (seq(nrow(sample_data))-1) %/% 100
)

We'll create a vector with the names of the tables in the database. Here, I'm just making simple vector with the names sample_1, sample_2, etc.

table_names <- paste0("sample_", 1:10)

Now we're ready to write to the database. We'll make a connection and then iterate over the list of data frames and the vector of table names simultaneously, calling dbWriteTable() each time:

connection <- dbConnect(RSQLite::SQLite(), dbname = "test.db")
map2(
  table_names, 
  list_of_dfs, 
  function(x,y) dbWriteTable(connection, x, y)
)

edited Mar 10 '22 at 00:25

answered Mar 10 '22 at 00:20

rdelrossi

1,114
1
7
17

Thank you so much! Was there any reason that you decided to use the RSQLite library? Is it possible to upload the files using "dbWriteTable()" insteadof "dbConnect()"? Thank you so much for all your help! – stats_noob Mar 10 '22 at 00:24
I used RSQLite because it was the quickest way for me to demonstrate the flow of activities. You can use whatever interface is appropriate for your database. At least with RSQLite, it's necessary to establish a connection to the database (that's what `connection` is) and then use that handle in subsequent called to `dbWriteTable()` or other database functions. – rdelrossi Mar 10 '22 at 00:26
dbWriteTable(my_connection, map2( table_names, list_of_dfs, function(x,y) dbWriteTable(connection, x, y)) .... should this work? Thank you! – stats_noob Mar 10 '22 at 00:27
No. Follow the flow above. Construct a list of data frames with `split()` and a vector of table names. When you call `map2()` it will call `dbWriteTable()` for every item in the list and in the vector. In this example, that one call to `map2()` will result in 10 calls to `dbWriteTable()`. – rdelrossi Mar 10 '22 at 00:31
1

Oh, I think I understand! This line just establishes the connection: connection <- dbConnect(RSQLite::SQLite(), dbname = "test.db") .... I don't have to run this line. I can just keep my pre-existing connection – stats_noob Mar 10 '22 at 00:33
So all I have to do is just run the following lines once all the mini tables have been created: map2( table_names, list_of_dfs, function(x,y) dbWriteTable(connection, x, y) ) – stats_noob Mar 10 '22 at 00:33
I think you have it! Good luck. – rdelrossi Mar 10 '22 at 00:34
Thank you so much! BTW this question that I posted is actually in reference to another problem I am currently facing (https://stackoverflow.com/questions/71417489/tricking-an-sql-server-to-accept-a-file-from-r).I thought of this idea (that you showed me how to do ) as a diagnostic tool to attempt to solve this other problem. Once again, thank you so much! – stats_noob Mar 10 '22 at 00:35

Splitting a Dataset into Arbitrary Sections

1 Answers1

Linked