1

The data is just like this:

View(df)

row    Events 
1       237,2,236,102,106,111,114,115,116,117,118,119,125
2       237,111,116
3       102,106,111,114,115

I got around the 3.5 million rows, and what I want is to create new binary columns, like this:

row   237  2  236  102  106  111  114  115  116  117 118  119 125  126
1     1    1   1    1    1    1    1    1    1    1   1    1   1   0
2     1    0   0    0    0    1    0    0    1    0   0    0   0   0  
3     0    0   0    1    1    1    1    1    0    0   0    0   0   0

I used the same solution as here: Create new columns with dummies based on values which is:

Event  <- as.data.frame.matrix(table(stack(setNames(strsplit(df$event, ","), df$row))[2:1]))

And it worked on a small data set. But with the 3.5 million rows I got the error:

Error in table(stack(setNames(strsplit(data$event, ","), data$row))[2:1]) :
attempt to make a table with >= 2^31 elements

I think the error is because I'm making the table too big. But I really need those columns. How can I fix this?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
henktimmer
  • 11
  • 1
  • 1
    You could split it into multiple tables. – LAP Jan 17 '18 at 09:39
  • Can you try without the `table(stack` and get the frequency on individual list elements. i..e `lst <- lapply(strsplit(df$event, ","), table)` or as @LAP mentioned, split the dataset into smaller chunks and then apply your code on it – akrun Jan 17 '18 at 09:39
  • Have to run my script again @Akrun. Give you an update ASAP (sorry it is a long and slow process) – henktimmer Jan 17 '18 at 09:52
  • IN that case, try to run the first part and store in a list i.e. `lst <- strsplit(df$event, ",")` – akrun Jan 17 '18 at 09:53
  • The first code: lst <- lapply(strsplit(data$event, ","), table) didn't gave an error, but I had no idea if it worked. So now I try to reform it with the as.data.frame(lst) function to see if it worked. – henktimmer Jan 17 '18 at 10:28
  • It looks like the lst <- strsplit(data$event, ",") worked (I get a list as result). But now I'm trying to transform it to a data frame (to work with the data), but unfortunately, my computer is out of memory (when running the > as.data.fram(lst) my memory usage is increased from 60% to 100%. Maybe another question, but do you maybe know if there are more sufficient methods of dealing with this issue (so getting the data back to a DF / table)? – henktimmer Jan 17 '18 at 10:53
  • Oh and it didn't work (my method): > as.data.frame(lst) Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 22, 4, 1, 25, 28, 31, 30, 2, 24, 35, 29, 23, 6, 19, 20, 27, 21, 0, 5, 7, 26, 32, 36, 34, 33, 12, 3, 1 – henktimmer Jan 17 '18 at 10:54
  • @LAP can you give some advice how I can split this table in multible tables? – henktimmer Jan 17 '18 at 13:09
  • Do you know how many unique values you have in your `Events` within all 3.5 million rows? – LAP Jan 17 '18 at 13:44
  • 19820 unique rows – henktimmer Jan 17 '18 at 16:14

0 Answers0