1

I read many questions here on memory management. So I cleaned up my GB size data and narrowed it down to 32MB (770K rows) and store it on BigQuery now. But when I try to turn it into matrix, as.data.frame(str_split_fixed(event_list$event_list, ",", max(length(strsplit(event_list$event_list, ","))))), I get this error

Error: cannot allocate vector of size 4472.6 Gb

Is there anyway to fix this problem, what am I doing wrong here? Is that I store it on BigQuery or converting it to matrix increases the size?

Efe
  • 179
  • 1
  • 11
  • Isn't it your str_split that is generating way to much data. Check if `as.data.frame(event_list)` work first – Emmanuel-Lin Dec 06 '17 at 15:51
  • What's the length of `event_list$event_list`? Because you're trying to create a data.frame with that many columns. – Nathan Werth Dec 06 '17 at 15:53
  • @Emmanuel-Lin that part of it it works – Efe Dec 06 '17 at 16:01
  • @NathanWerth 700 thousand rows and 10 columns – Efe Dec 06 '17 at 16:01
  • 1
    Maybe I'm missing something here, but it seems like you have a lot of redundancy going on. Not sure why you are calling `max(length(etc.` as length returns a single integer. I think maybe you were going for `lengths` (with the "s"). Also, `strsplit` and `str_split_fixed` are very similar. You could theoretically get the results of `str_split_fixed` from `strsplit`. If you have to compute something more than once, especially if it is an expensive operation, you should only compute one time and store it in a variable. Continued... – Joseph Wood Dec 06 '17 at 16:14
  • I would do something like so: `eventSplit <- strsplit(event_list$event_list, ","); maxLen <- max(lengths(eventSplit));` Then using any of the awesome answers found in [How to convert a list consisting of vector of different lengths to a usable data frame in R?](https://stackoverflow.com/q/15201305/4408538), we can get our `data.frame`. I'll use the `plyr` one here as it is super short (not sure if this is the fatest) `plyr::ldply(eventSplit, rbind)` and Voila! – Joseph Wood Dec 06 '17 at 16:26
  • And now that I think about it, I bet your problem is the call to `length` instead of `lengths`. If your list is really large, this would cause `str_split_fixed` to make a huge matrix with every row the length of your list. – Joseph Wood Dec 06 '17 at 16:28
  • @JosephWood very valid points, I will try your suggestions now! – Efe Dec 06 '17 at 17:11

1 Answers1

2

@JosephWood nailed it. If event_list has 700,000 rows, then you're trying to create a data.frame with 700,000 rows and 700,000 columns. strsplit(event_list$event_list, ",") would be a list of length 700,000, so length(strsplit(event_list$event_list, ",")) gives a single number: 700000. max of one number is just that number. You should use lengths instead.

So your call to str_split_fixed ends up acting like this:

str_split_fixed(event_list$event_list, ",", n = 700000)

That gives a list of 700,000 elements (length of event_list$event_list), each element being a character vector with 700,000 values (n).

On my machine, I roughly estimated the necessary memory:

format(700000 * object.size(character(700000)), "GB")
# [1] "3650.8 Gb"

That's not counting any extra memory required to store those vectors in a data.frame.

The solution:

split_values <- strsplit(event_list$event_list, ",")
value_counts <- lengths(split_values)
extra_blanks <- lapply(max(value_counts) - value_counts, character)
values_with_blanks <- mapply(split_values, extra_blanks, FUN = c, SIMPLIFY = FALSE)
DF <- as.data.frame(values_with_blanks)
Nathan Werth
  • 5,093
  • 18
  • 25