0

I have a dataset looks like this:

(Visualising the datasets below may help you to understand the question)

original <- data.frame(
  ID = c(rep("John", 4), "Steve"),
  A = as.integer(c(rep(3, 4), 1)),
  b = c(2, 3, 4, 2, 2),
  B = c(rep(4, 4), 2),
  detail1 = c("Yes", "Sure", "No", "Yes", "Yes"),
  detail2 = c(rep("Unique1", 4), "Unique2")
)

Values in A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.

I need to complete this dataset with a new variable a added, the completed dataset will look like this:

completed1 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  A = c(rep(3, 12), rep(1, 2)),
  b = c(rep(1:4, 3), 1, 2),
  B = c(rep(4, 12), rep(2, 2)),
  detail1 = c("Absence", "Yes", "Sure", "No", "Absence", "Yes", rep("Absence", 7), "Yes"),
  detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)

Values in a are integers too and a has values from 1 to the value of A. Note that b is nested in a.

I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).

I know the challenge is that there are multiple locations to allocate values in detail1 correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:

completed2 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  A = c(rep(3, 12), rep(1, 2)),
  b = c(rep(1:4, 3), 1, 2),
  B = c(rep(4, 12), rep(2, 2)),
  detail1 = c("Absence", "Yes", rep("Absence", 4), "Sure", "Absence", "Absence", "Yes", "Absence", "No", "Absence", "Yes"),
  detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)

I would like to complete the dataset following the logic of completed1 above: values in detail1 go to the smallest value in a first, and if repeated values in b present (for example, the Yes in b under John in original dataset), the repeated value goes to the next value in a.

Is it possible to do this?

My actual dataset has more variables than this example and the completed dataset will have more than 700,000 rows, so I prefer fast methods to automate it.

Thanks very much!!!

Juan_814
  • 51
  • 8
  • I think adding `a` is the easy part, `original %>% mutate(a = lapply(A, seq), b = lapply(B, seq)) %>% tidyr::unnest(a)` should do it just fine. I can think of a couple ways to do `b`, but they don't seem very good. I'll sleep on it and see if anything comes. – Gregor Thomas Oct 27 '21 at 04:27
  • @GregorThomas much appreciated! – Juan_814 Oct 27 '21 at 04:35

1 Answers1

1

Please let me know if it's not following your purpose.

comp_dummy <- original %>%
  group_by(ID) %>%
  expand(A = A, a = 1:A, B = B, b = 1:B)


dummy <- original %>%
  group_by(ID, A,b,B, detail1) %>%
  mutate(a = 1:n())

  ID        A     b     B detail1 detail2     a
  <chr> <int> <dbl> <dbl> <chr>   <chr>   <int>
1 John      3     2     4 Yes     Unique1     1
2 John      3     3     4 Sure    Unique1     1
3 John      3     4     4 No      Unique1     1
4 John      3     2     4 Yes     Unique1     2
5 Steve     1     2     2 Yes     Unique2     1

comp_dummy %>%
  full_join(dummy, by = c("ID","A","a","B","b")) %>%
  group_by(ID) %>%
  mutate(detail2 = unique(detail2[!is.na(detail2)]),
         detail1 = replace_na(detail1, "Absence")) 

   ID        A     a     B     b detail1 detail2
   <chr> <int> <int> <dbl> <dbl> <chr>   <chr>  
 1 John      3     1     4     1 Absence Unique1
 2 John      3     1     4     2 Yes     Unique1
 3 John      3     1     4     3 Sure    Unique1
 4 John      3     1     4     4 No      Unique1
 5 John      3     2     4     1 Absence Unique1
 6 John      3     2     4     2 Yes     Unique1
 7 John      3     2     4     3 Absence Unique1
 8 John      3     2     4     4 Absence Unique1
 9 John      3     3     4     1 Absence Unique1
10 John      3     3     4     2 Absence Unique1
11 John      3     3     4     3 Absence Unique1
12 John      3     3     4     4 Absence Unique1
13 Steve     1     1     2     1 Absence Unique2
14 Steve     1     1     2     2 Yes     Unique2
Park
  • 14,771
  • 6
  • 10
  • 29