I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 4), "Steve"),
A = as.integer(c(rep(3, 4), 1)),
b = c(2, 3, 4, 2, 2),
B = c(rep(4, 4), 2),
detail1 = c("Yes", "Sure", "No", "Yes", "Yes"),
detail2 = c(rep("Unique1", 4), "Unique2")
)
Values in A
, B
, and b
are all integers. Variable b
is incomplete in this dataset and it actually has values from 1 to the value of B
.
I need to complete this dataset with a new variable a
added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
A = c(rep(3, 12), rep(1, 2)),
b = c(rep(1:4, 3), 1, 2),
B = c(rep(4, 12), rep(2, 2)),
detail1 = c("Absence", "Yes", "Sure", "No", "Absence", "Yes", rep("Absence", 7), "Yes"),
detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)
Values in a
are integers too and a
has values from 1 to the value of A
. Note that b
is nested in a
.
I think the most relevant functions to complete a dataset in this way are tidyr::complete()
and tidyr::expand()
, but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail1
correspondingly to values in the newly added a
through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
A = c(rep(3, 12), rep(1, 2)),
b = c(rep(1:4, 3), 1, 2),
B = c(rep(4, 12), rep(2, 2)),
detail1 = c("Absence", "Yes", rep("Absence", 4), "Sure", "Absence", "Absence", "Yes", "Absence", "No", "Absence", "Yes"),
detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)
I would like to complete the dataset following the logic of completed1
above: values in detail1
go to the smallest value in a
first, and if repeated values in b
present (for example, the Yes
in b
under John
in original
dataset), the repeated value goes to the next value in a
.
Is it possible to do this?
My actual dataset has more variables than this example and the completed dataset will have more than 700,000 rows, so I prefer fast methods to automate it.
Thanks very much!!!