0

I am trying to use this as a learning exercise. I have a dataset of 343345 rows, and need to geocode the data in ArcGIS, however the full dataset is too large to run properly. I need to break up the data into chunks of 50,000 rows, with the last chunk being a bit smaller.

Currently, I am doing this manually, in individual chunks of 50,000, as seen here

gen_gis_test_1_50 = gen_gis[c(1:50000),]
gen_gis_test_51_100 = gen_gis[c(50001:100000),]

Then I use write.csv() for each new variable I created.

Ive done many of these rote exercises and wanted to see what the best method for writing a function would be. Ideally the function would read the main file, then spit out a new variable for every 50,000 rows. Then id like to write a second function that goes through each of those and writes it as a csv.

Many thanks in advance!

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Deeba Yavrom
  • 81
  • 1
  • 11
  • First part of your question seems a duplicate of: https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows – iago Jul 23 '20 at 11:52

2 Answers2

1

Take the built-in dataset iris for example, which has 150 rows. Now I want to break up the data into chunks of 40 rows. We know 150 / 40 = 3...30, so the last chunk will be iris[121:150, ]. Then with the following calculation:

df <- iris
n <- 40
grp <- (1:nrow(df)-1) %/% n

table(grp)
#  1  2  3  4 
# 40 40 40 30 

You can get 4 groups. Each group has 40 rows except the last one. Finally, use split() to divide data and Map() to export each one.

df_split <- split(df, grp)
Map(function(x, y){
  write.csv(x, sprintf("gen_gis_%d_%d.csv", y, y + nrow(x) - 1), row.names = F)
}, df_split, (seq_along(df_split)-1) * n + 1)

Check your working directory:

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
1

If you are okay with using the tidyverse there are some neat functions that do these in a short and easy way. You first create a grouping variable (just divide the rownumbers by 50000 and take the modulo). Next you group the dataset using dplyrs group_by function before you write the data to csv files by using the group_map function also from the dplyr package.

Using Darren Tsais example (iris datasets splitting to 40 rows each), this would look like this:

library(dplyr)

iris %>%
  mutate(grp = (row_number() - 1) %/% 40) %>% #create the modulo of the rownumber divided by 40
  group_by(grp) %>%
  group_map(~write.csv(.x, file = paste0("dataset_split_", .y, ".csv")))

In the group_map function the ~ is used to define the function that will be applied to each group of your dataframe. .x is representing the (split) datasets (which are now at most 40 rows long) and .y is representing the grouping value (in this case 0-3) so that you have a different filename for each group. The resulting files would be named dataset_split_0.csv, dataset_split_1.csv, dataset_split_2.csv and dataset_split_3.csv. In your case just replace iris with gen_gis and 40 with 50000 and it should work.

FloSchmo
  • 723
  • 5
  • 9
  • 1
    thanks, so tried your example but i get the error "Error in group_by(grp) : object 'grp' not found". I did replace iris and 40 with my specific values too. The error comes after i run the "group_by(grp)..." code – Deeba Yavrom Jul 23 '20 at 16:34
  • 1
    It's more thoughtful to use `mutate(grp = (row_number() - 1) %/% 40)`. You can check the difference between `table((1:150) %/% 40)` and `table((1:150 - 1) %/% 40)`. – Darren Tsai Jul 23 '20 at 17:52
  • @DeebaYavrom Oh, right, I forgot a `%>%` after the `group_by`, which is needed so the whole code chunk gets executed as one chain of functions. – FloSchmo Jul 24 '20 at 06:59