Dividing one dataframe into many with names in R

Question

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.

What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.

Once I do the work, I will start to recombine them (using rbind()) one pair at a time.

I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).

Any ideas?

I would suggest using `readLines` with an open file connection to read a few rows at a time, process them as desired, and save the output sequentially. See here: https://stackoverflow.com/questions/12626637/read-a-text-file-in-r-line-by-line — jdobres, Jan 16 '22 at 03:04

www · Accepted Answer · 2022-01-16T04:28:11.063

2

You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.

# Number of groups
N <- 20

dat$group <- 1:nrow(dat) %% N

# Add 1 to group
dat$group <- dat$group + 1

# Split the dat by group
dat_list <- split(dat, f = ~group)

# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)

Data

set.seed(123)

# Create example data frame
dat <- data.frame(
  A = sample(letters, size = 70000000, replace = TRUE),
  B = rpois(70000000, lambda = 1)
)

edited Jan 16 '22 at 04:28

answered Jan 16 '22 at 03:19

www

38,575
12
48
84

I can see how this would work; it seems tedious, though. – Karl Wolfschtagg Jan 16 '22 at 20:37

score 1 · Answer 2 · answered Jan 16 '22 at 03:06

Here's a tidyverse based solution. Try using read_csv_chunked().

# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
               value = rnorm(1e6) %>% 
write_csv("test.csv")

# here's the solution
partial_data <- read_csv_chunked("test.csv", 
              DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
              chunk_size = 1000)

You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.

This is more or less a repeat of this question: How to read only lines that fulfil a condition from a csv into R?

Interesting - I've never seen `read_csv_chunked`. I'm going to look into it. — Karl Wolfschtagg, Jan 16 '22 at 20:38

Dividing one dataframe into many with names in R

2 Answers2