Create individual rows based on sum value for fake dataset

Question

I am creating a fake dataset, and would like to essentially disaggregate a sum to create dummy rows that I can populate with random dates.

For example, my df might look like this:

id    orders   skips
joe   3        0
mary  2        1
jack  5        1

I want to produce is a data.frame or data.table that looks like this, where a successful order is 1 and a skip is 0:

id    order
joe   1
joe   1
joe   1
mary  1
mary  0
mary  1
jack  1
jack  1
jack  1
jack  1
jack  0
jack  1

ADDITION: Ideally, the 0 values would be randomly mixed/sandwiched between 1 values if possible. This is due to a quirk of what the dataset will be used for in a problem set.

In a perfect world, I'd then assign a random start_date from a given range to each order within id, such that:

id    order  date
joe   1     1/2/2016
joe   1     1/3/2016
joe   1     1/8/2016
mary  1     1/10/2016
mary  0     1/3/2016
mary  1     1/5/2016
jack  1     1/7/2016
jack  1     1/2/2016
jack  1     1/1/2016
jack  1     1/10/2016
jack  0     1/12/2016
jack  1     1/15/2016

I initially thought that I could use a combination of dcast and reshape to trick R into making the dataset, e.g.dcast(df,id~orders,fun.aggregate=length) but this took me down the wrong path.

But, one must walk before they crawl. Anyone able to help?

@josliber I've added a few of my ideas (`dcast` and `reshape`) but didn't want to send anyone down a rabbit hole that I knew to be wrong. Hopefully this helps! — roody, Feb 29 '16 at 00:14
`x <- Vectorize(rep)(setNames(rep(1:0, nrow(df)), rep(df[, 1], each = 2)), (t(df[, -1]))); data.frame(id = names(x), order = x)` — rawr, Feb 29 '16 at 00:23

score 2 · Accepted Answer · answered Feb 29 '16 at 00:32

2

Here's one approach with data.table:

dt[, .(order = rep(c(1, 0), c(orders, skips))), by = "id"]
#      id order
#1:   joe     1
#2:   joe     1
#3:   joe     1
#4:  mary     1
#5:  mary     1
#6:  mary     0
#7:  jack     1
#8:  jack     1
#9:  jack     1
#10: jack     1
#11: jack     1
#12: jack     0

Data:

library(data.table)
dt <- fread(
  "id    orders   skips
  joe   3        0
  mary  2        1
  jack  5        1"
)

answered Feb 29 '16 at 00:32

nrussell

18,382
4
47
60

Not in my question now (I'll go back and edit), but do you have any thoughts on how I could make it so that the 0's are in middle rows when possible? e.g., row #6 would be row #5. This is a quirk of the problem that the fake dataset will be used for. – roody Feb 29 '16 at 01:22

score 0 · Answer 2 · answered Feb 29 '16 at 01:04

You can do it in base R using tapply (or split and lapply, if you prefer) and then rbinding everything back together:

df2 <- do.call(rbind, tapply(df, df$id, 
                             function(x){
                                 data.frame(id = rep(x$id, sum(x$orders, x$skips)), 
                                            order = sample(rep(c(1, 0), c(x$orders, x$skips)))
                                 )
                             }))
rownames(df2) <- NULL

where tapply runs the anonymous function across groups of df$id, and do.call(rbind, rearranges the list back into a single data.frame. The anonymous function makes a data.frame by repeating id the necessary number of times and using sample to shuffle 0 and 1 repeated orders and skips numbers of times, respectively.

One hiccup, which should be fixable: rbind automatically creates row names, which are ugly and unnecessary. There is an argument to turn this off, but I can't get it arranged in the do.call structure properly, so the above just erases them in a second line. (If you know the right place to stick make.row.names = FALSE, comment and I'll edit.)

The result:

> df2
     id order
1  jack     0
2  jack     1
3  jack     1
4  jack     1
5  jack     1
6  jack     1
7   joe     1
8   joe     1
9   joe     1
10 mary     1
11 mary     0
12 mary     1

Create individual rows based on sum value for fake dataset

2 Answers2