My data is in a not-quite-wide format where most variables have their own columns but the last two (point_type
and point_value
) are long. Like this:
strain light hours point_type point_value
------ ----- ----- ---------- -----------
blank 25 17 log_od750 -0.9018948
blank 25 17 log_od750 -2.0778428
blank 25 17 log_od750 -1.6129440
blank 25 17 log_od750 -2.0076792
blank 25 17 log_od750 -2.8913722
blank 25 17 log_od750 -2.4292827
How can I extract a wide-format data frame with points of a few specific types? For example I want the columns strain
, light
, hours
, od750
, and log_od750
, and to drop any other point_type
s. I assume the easiest
way is to subset like data <- subset(data, point_type %in% c('od750', 'log_od750'))
first, then reshape?
I've tried various dcast
formulas but don't really get what I'm doing. Do I need to melt
it the rest of the way first? And if so how to I prevent duplicates from being aggregated on recasting?
I realize there are a lot of reshaping questions and answers out there! But I haven't been able to figure it out from them all morning and I'm not even sure which ones are relevant to my problem.
EDIT: better sample data.
set.seed(1)
data <- data.frame(
strain=sample(paste0('strain', 1:10), 100, replace=TRUE),
light=sample(100),
hours=sample(100),
point_type=sample(c('od750', 'log_od750', 'loess', 'loess_err',
'locfit', 'locfit_err'), 100, replace=TRUE),
point_value=sample(100)
)
(The columns I want to reshape together will have matching hours
values,
I'm just not sure how to write a sample
call that demonstrates that.)
EDIT: eipi10's answer almost works, but I think I need to additionally work around the inability to keep duplicate values in dcast
by adding an ID column, as shown in several other questions:
- Can dcast be used without an aggregate function?
- reshape2 dcast without aggregation - problems with seq
- How to use "cast" in reshape without aggregation
But after reading + trying those examples I'm still not sure how to construct that column.
EDIT: Working solution based on eipi10's answer and this one:
cast_points <- function(df, ptype_regex) {
# takes a data frame in "long-ish" format
# (identifying columns + point_type, point_value),
# and extracts a wide-format data frame
# note: no aggregation is done, which makes sense if your data contains
# lots of replicate measurements you want to keep separate
# note: this also works best when you have values for each chosen point_type
# for each combination of other columns. if not, you get lots of NAs
idcols <- colnames(df)[!colnames(df) %in% c('point_type', 'point_value')]
ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
gbdots <- lapply(c(idcols, 'point_type'), as.symbol)
dcform <- paste(paste(c(idcols, 'counter'), collapse='+'), '~', 'point_type')
df <- df[df[['point_type']] %in% ptypes,]
df <- df %>%
group_by_(.dots=gbdots) %>%
mutate(counter=1:n()) %>%
dcast(formula=dcform, value.var="point_value", drop=TRUE)
df[['counter']] <- NULL
return(df)
}
head(data)
# strain resistance light light_bin hours point_type point_value
# 1 blank <NA> 25 025-025 uE 17 od750 0.4058
# 2 blank <NA> 25 025-025 uE 17 od750 0.1252
# 3 blank <NA> 25 025-025 uE 17 od750 0.1993
# 4 blank <NA> 25 025-025 uE 17 od750 0.1343
# 5 blank <NA> 25 025-025 uE 17 od750 0.0555
# 6 blank <NA> 25 025-025 uE 17 od750 0.0881
tail(data)
# strain resistance light light_bin hours point_type point_value
# 93516 strain1 Km 25 025-025 uE 17 log_od750 -0.9670579
# 93517 strain1 Km 25 025-025 uE 17 log_od750 -1.4605870
# 93540 strain1 Km 25 025-025 uE 17 log_od750 -1.8300846
# 93542 strain1 Km 25 025-025 uE 17 log_od750 -1.1779802
# 93554 strain1 Km 25 025-025 uE 17 log_od750 -2.0448469
# 93556 strain1 Km 25 025-025 uE 17 log_od750 -1.8413700
head(cast_points(data, 'log'))
# strain resistance light light_bin hours log_od750
# 1 blank <NA> 25 025-025 uE 17 -0.9018948
# 2 blank <NA> 25 025-025 uE 17 -2.0778428
# 3 blank <NA> 25 025-025 uE 17 -1.6129440
# 4 blank <NA> 25 025-025 uE 17 -2.0076792
# 5 blank <NA> 25 025-025 uE 17 -2.8913722
# 6 blank <NA> 25 025-025 uE 17 -2.4292827
head(cast_points(data, 'od'))
# strain resistance light light_bin hours log_od750 od750
# 1 blank <NA> 25 025-025 uE 17 -0.9018948 0.4058
# 2 blank <NA> 25 025-025 uE 17 -2.0778428 0.1252
# 3 blank <NA> 25 025-025 uE 17 -1.6129440 0.1993
# 4 blank <NA> 25 025-025 uE 17 -2.0076792 0.1343
# 5 blank <NA> 25 025-025 uE 17 -2.8913722 0.0555
# 6 blank <NA> 25 025-025 uE 17 -2.4292827 0.0881
EDIT: In case anyone else got this far, here are a couple other functions for manipulating the same kind of data:
remove_points <- function(df, ptype_regex) {
# takes a data frame in "long-ish" format
# (identifying columns + point_type, point_value),
# and removes rows with point_types matching the regex
# useful for removing points that you just extracted with cast_points
ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
df[!df[['point_type']] %in% ptypes,]
}
append_points <- function(df1_longish, df2_wide) {
# takes one main data frame in "long-ish" format and a second in wide format
# melts the wide one into long-ish and appends to the main one
idcols <- colnames(df1_longish)[!colnames(df1_longish)
%in% c('point_type', 'point_value')]
df2_longish <- melt(df2, variable.name='point_type',
value.name='point_value', id.vars=idcols)
rbind(df1_longish, df2_longish)
}
sac_points <- function(df, fn, ptype_regex) {
# split-apply-combine for "long-ish" data
misc <- drop_points(df, ptype_regex)
df <- cast_points(df, ptype_regex)
df <- fn(df)
append_points(misc, df)
}