1

I have a dataset looks like this:

(Visualising the datasets below may help you to understand the question)

original <- data.frame(
  ID = c(rep("John", 3), "Steve"),
  A = c(rep(3, 3), 1),
  B = c(rep(4, 3), 2),
  b = c(2, 3, 2, 2),
  detail = c(rep("GOOOOD", 4))
)

Values in variable A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.

I need to complete this dataset with a new variable a added, the completed dataset will look like this:

completed1 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)

Values in variable a are integers too and a has values from 1 to the value of A. Values in b are nested in each value of a, and values in a are nested in each factor of ID.

I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).

I know the challenge is that there are multiple locations to allocate values in detail correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:

completed2 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)

Where the values in detail got located in the completed dataset does not matter to me. My actual dataset has more than 40,000 rows, so I really need something to automate it.

Is it possible to do this? Thanks very much!!!

Juan_814
  • 51
  • 8

3 Answers3

1

It's pretty messy using for loop, and it will give very random position of GOOOOD

comp_dummy <- original %>%
  group_by(ID) %>%
  expand(A = A, a = 1:A, B = B, b = 1:B)

original <- original %>%
  group_by(ID, A, B, b) %>%
  summarise(n = n())

vec <- rep(NA_character_, nrow(comp_dummy))

for (i in 1:nrow(original)){
  x <- original[i,]
  
  y <- comp_dummy %>%
    rownames_to_column(., "row") %>%
    filter(ID == x$ID, A == x$A, B == x$B, b == x$b)  %>%
    pull(row)
  z <- sample(y, x$n, replace = FALSE)  %>% as.numeric()
  print(z)
  vec[{z}] <- "GOOOOD"
}

comp_dummy$detail <- vec
comp_dummy

   ID        A     a     B     b detail
   <chr> <dbl> <int> <dbl> <int> <chr> 
 1 John      3     1     4     1 NA    
 2 John      3     1     4     2 GOOOOD
 3 John      3     1     4     3 NA    
 4 John      3     1     4     4 NA    
 5 John      3     2     4     1 NA    
 6 John      3     2     4     2 NA    
 7 John      3     2     4     3 NA    
 8 John      3     2     4     4 NA    
 9 John      3     3     4     1 NA    
10 John      3     3     4     2 GOOOOD
11 John      3     3     4     3 GOOOOD
12 John      3     3     4     4 NA    
13 Steve     1     1     2     1 NA    
14 Steve     1     1     2     2 GOOOOD
Park
  • 14,771
  • 6
  • 10
  • 29
  • This indeed does what I asked in the question, so I accepted this answer. In fact, I oversimplified my dataset and the loop does not work on my actual dataset (an error appeared when the printed values reached 3,000,000 something, I think the actual dataset is too big). I edited the question and reposted it, see if you're interested: https://stackoverflow.com/questions/69732688/r-complete-a-dataset-with-a-new-column-added – Juan_814 Oct 27 '21 at 04:16
  • @JuanW814 I take a look at it and add an answer. – Park Oct 27 '21 at 04:38
0

I wonder whether doing the complete twice, first for the a and then for b can be a solution. You can adjust different nesting, or group_by if needed.

Depending if the maximum a shall be from A within the ID group or not you shall adjust/remove the group_by (similar for b within the a group)

library(dplyr)
library(tidyr)

original %>%
  dplyr::mutate(a = 1) %>%
  dplyr::group_by( ID ) %>%
  tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
  group_by( a ) %>%
  tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
  dplyr::ungroup()
Volodymyr
  • 888
  • 10
  • 21
  • Thanks but this will increase the levels in `b` under `Steve` from 2 to 4, which cannot happen. Actually I oversimplified the dataset, so I edited the question and posted it again, see if you're interested: https://stackoverflow.com/questions/69732688/r-complete-a-dataset-with-a-new-column-added – Juan_814 Oct 27 '21 at 04:26
  • Then just do `group_by( ID, a )` and remove `ID` from the second `nesting`. With `group_by` you can specify the nesting pattern. Then `Steve` will be only twice. – Volodymyr Oct 28 '21 at 06:26
0

A base R solution

do.call(
  rbind,
  by(original,list(original$ID),function(x){
    tmp=merge(
      unique(x),
      setNames(
        expand.grid(
          unique(x$ID),
          x$A[1],
          1:max(x$A),
          x$B[1],
          1:max(x$B)
        ),
        c("ID","A","a","B","b")
      ),
      by=c("ID","A","B","b"),
      all=T
    )
    tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
  })
)

resulting in

           ID A a B b detail
John.1   John 3 1 4 1   <NA>
John.5   John 3 1 4 2 GOOOOD
John.8   John 3 1 4 3 GOOOOD
John.11  John 3 1 4 4   <NA>
John.2   John 3 2 4 1   <NA>
John.4   John 3 2 4 2 GOOOOD
John.9   John 3 2 4 3 GOOOOD
John.12  John 3 2 4 4   <NA>
John.3   John 3 3 4 1   <NA>
John.6   John 3 3 4 2 GOOOOD
John.7   John 3 3 4 3 GOOOOD
John.10  John 3 3 4 4   <NA>
Steve.1 Steve 1 1 2 1   <NA>
Steve.2 Steve 1 1 2 2 GOOOOD
user2974951
  • 9,535
  • 1
  • 17
  • 24
  • Thanks but I can't duplicate values in `detail`! In fact, I oversimplified my dataset in this question, so I edited the question and posted again, see if you're interested: https://stackoverflow.com/questions/69732688/r-complete-a-dataset-with-a-new-column-added – Juan_814 Oct 27 '21 at 04:23