Up front, the two methods below are completely different, not equivalents in "base R vs dplyr". I'm sure either can be translated to the other.
dplyr
The premise here is to first reshape/pivot the data longer so that each Drug/Dose is on its own line, renumber them appropriately, and then bring it back to a wide state.
NOTE: frankly, I usually prefer to deal with data in a long format, so consider keeping it in its state immediately before pivot_wider
. This means you'd need to bring Age
and Place
back into it somehow.
Why? A long format deals very well with many types of aggregation; ggplot2
really really prefers data in the long format; I dislike seeing and having to deal with all of the NA
/empty values that will invariably happen with this wide format, since many PIDs don't have (e.g.) Drug6
or later. This seems subjective, but it can really be an objective change/improvement to data-mangling, depending on your workflow.
library(dplyr)
# library(tidyr) # pivot_longer, pivot_wider
dat0 <- select(dat, PID, Date, Age, Place) %>%
group_by(PID, Date) %>%
summarize(across(everything(), ~ .[!is.na(.) & nzchar(trimws(.))][1] ))
dat %>%
select(-Age, -Place) %>%
tidyr::pivot_longer(
-c(Row, PID, Date),
names_to = c(".value", "iter"),
names_pattern = "^([^0-9]+)([123]?)$") %>%
arrange(Row, iter) %>%
group_by(PID, Date) %>%
mutate(iter = row_number()) %>%
select(-Row) %>%
tidyr::pivot_wider(
c("PID", "Date"), names_sep = "",
names_from = "iter", values_from = c("Drug", "Dose")) %>%
left_join(dat0, by = c("PID", "Date"))
# # A tibble: 5 x 16
# # Groups: PID, Date [5]
# PID Date Drug1 Drug2 Drug3 Drug4 Drug5 Drug6 Dose1 Dose2 Dose3 Dose4 Dose5 Dose6 Age Place
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
# 1 11A 25/10/2021 RPG NAT QRT BET "SET" "BLT" 12 34 5 10 43 45 45 PMk
# 2 12B 20/10/2021 ATY LTP CRT <NA> <NA> <NA> 13 3 3 NA NA NA 56 GTL
# 3 13A 22/10/2021 GGS GSF ERE DFS "" "" 7 12 45 5 NA NA 45 RKS
# 4 13A 26/10/2021 BRT ARR GSF <NA> <NA> <NA> 9 4 34 NA NA NA 46 GLO
# 5 14B 04/08/2021 GDS TRE HHS <NA> <NA> <NA> 2 55 34 NA NA NA 25 MTK
Notes:
- I broke out
dat0
early, since Age
and Place
don't really fit into the pivot/renumber/pivot mindset.
base R
Here's a base R method that splits (according to your grouping criteria: PID
and Date
), finds the Drug/Dose columns that need to be renumbered, renames them, and the merge
s all of the frames back together.
spl <- split(dat, ave(rep(1L, nrow(dat)), dat[,c("PID", "Date")], FUN = seq_along))
spl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 2 2 11A 25/10/2021 BET 10 SET 43 BLT 45 NA
# 6 6 13A 22/10/2021 DFS 5 NA NA NA
nms <- lapply(spl, function(x) grep("^(Drug|Dose)", colnames(x), value = TRUE))
nms <- data.frame(i = rep(names(nms), lengths(nms)), oldnm = unlist(nms))
nms$grp <- gsub("[0-9]+$", "", nms$oldnm)
nms$newnm <- paste0(nms$grp, ave(nms$grp, nms$grp, FUN = seq_along))
nms <- split(nms, nms$i)
newspl <- Map(function(x, nm) {
colnames(x)[ match(nm$oldnm, colnames(x)) ] <- nm$newnm
x
}, spl, nms)
newspl[-1] <- lapply(newspl[-1], function(x) x[, c("PID", "Date", grep("^(Drug|Dose)", colnames(x), value = TRUE)), drop = FALSE ])
newspl
# $`1`
# Row PID Date Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place
# 1 1 11A 25/10/2021 RPG 12 NAT 34 QRT 5 45 PMk
# 3 3 12B 20/10/2021 ATY 13 LTP 3 CRT 3 56 GTL
# 4 4 13A 22/10/2021 GGS 7 GSF 12 ERE 45 45 RKS
# 5 5 13A 26/10/2021 BRT 9 ARR 4 GSF 34 46 GLO
# 7 7 14B 04/08/2021 GDS 2 TRE 55 HHS 34 25 MTK
# $`2`
# PID Date Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 2 11A 25/10/2021 BET 10 SET 43 BLT 45
# 6 13A 22/10/2021 DFS 5 NA NA
Reduce(function(a, b) merge(a, b, by = c("PID", "Date"), all = TRUE), newspl)
# PID Date Row Drug1 Dose1 Drug2 Dose2 Drug3 Dose3 Age Place Drug4 Dose4 Drug5 Dose5 Drug6 Dose6
# 1 11A 25/10/2021 1 RPG 12 NAT 34 QRT 5 45 PMk BET 10 SET 43 BLT 45
# 2 12B 20/10/2021 3 ATY 13 LTP 3 CRT 3 56 GTL <NA> NA <NA> NA <NA> NA
# 3 13A 22/10/2021 4 GGS 7 GSF 12 ERE 45 45 RKS DFS 5 NA NA
# 4 13A 26/10/2021 5 BRT 9 ARR 4 GSF 34 46 GLO <NA> NA <NA> NA <NA> NA
# 5 14B 04/08/2021 7 GDS 2 TRE 55 HHS 34 25 MTK <NA> NA <NA> NA <NA> NA
Notes:
The underlying premise of this is that you want to merge the rows onto previous rows. This means (to me) using base::merge
or dplyr::full_join
; two good links for understanding these concepts, in case you are not aware: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
To do that, we need to determine which rows are duplicates of previous; further, we need to know how many previous same-key rows there are. There are a few ways to do this, but I think the easiest is with base::split
. In this case, no PID/Date combination has more than two rows, but if you had one combination that mandated a third row, spl
would be length-3, and the resulting names would go out to Drug9
/Dose9
.
The second portion (nms <- ...
) is where we work on the names. The first few steps create a nms
dataframe that we'll use to map from old to new names. Since we're concerned about contiguous numbering through all multi-row groups, we aggregate on the base (number removed) of the Drug/Dose names, so that we number all Drug
columns from Drug1
through how many there are.
Note: this assumes that there are always perfect pairs of Drug#
/Dose#
; if there is ever a mismatch, then the numbering will be suspect.
We end with nms
being a split dataframe, just like spl
of the data. This is useful and important, since we'll Map
(zip-like lapply
) them together.
The third block updates spl
with the new names. The result in newspl
is just renaming of the columns so that when we merge them together, no column-duplication will occur.
One additional step here is removing unrelated columns from the 2nd and subsequent frame in the list. That is, we keep Age
and Place
in the first such frame but remove it from the rest. My assumption (based on the NA
/empty nature of those fields in duplicate rows) is that we only want to keep the first row's values.
The last step is to iteratively merge
them together. The Reduce
function is nice for this.