I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 3), "Steve"),
A = c(rep(3, 3), 1),
B = c(rep(4, 3), 2),
b = c(2, 3, 2, 2),
detail = c(rep("GOOOOD", 4))
)
Values in variable A
, B
, and b
are all integers. Variable b
is incomplete in this dataset and it actually has values from 1 to the value of B
.
I need to complete this dataset with a new variable a
added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)
Values in variable a
are integers too and a
has values from 1 to the value of A
. Values in b
are nested in each value of a
, and values in a
are nested in each factor of ID
.
I think the most relevant functions to complete a dataset in this way are tidyr::complete()
and tidyr::expand()
, but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail
correspondingly to values in the newly added a
through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)
Where the values in detail
got located in the completed dataset does not matter to me. My actual dataset has more than 40,000 rows, so I really need something to automate it.
Is it possible to do this? Thanks very much!!!