2

My data is in a not-quite-wide format where most variables have their own columns but the last two (point_type and point_value) are long. Like this:

strain light hours point_type point_value
------ ----- ----- ---------- -----------
blank  25    17    log_od750  -0.9018948
blank  25    17    log_od750  -2.0778428
blank  25    17    log_od750  -1.6129440
blank  25    17    log_od750  -2.0076792
blank  25    17    log_od750  -2.8913722
blank  25    17    log_od750  -2.4292827

How can I extract a wide-format data frame with points of a few specific types? For example I want the columns strain, light, hours, od750, and log_od750, and to drop any other point_types. I assume the easiest way is to subset like data <- subset(data, point_type %in% c('od750', 'log_od750')) first, then reshape?

I've tried various dcast formulas but don't really get what I'm doing. Do I need to melt it the rest of the way first? And if so how to I prevent duplicates from being aggregated on recasting?

I realize there are a lot of reshaping questions and answers out there! But I haven't been able to figure it out from them all morning and I'm not even sure which ones are relevant to my problem.

EDIT: better sample data.

set.seed(1)
data <- data.frame(
  strain=sample(paste0('strain', 1:10), 100, replace=TRUE),
  light=sample(100),
  hours=sample(100),
  point_type=sample(c('od750', 'log_od750', 'loess', 'loess_err',
                      'locfit', 'locfit_err'), 100, replace=TRUE),
  point_value=sample(100)
)

(The columns I want to reshape together will have matching hours values, I'm just not sure how to write a sample call that demonstrates that.)

EDIT: eipi10's answer almost works, but I think I need to additionally work around the inability to keep duplicate values in dcast by adding an ID column, as shown in several other questions:

But after reading + trying those examples I'm still not sure how to construct that column.

EDIT: Working solution based on eipi10's answer and this one:

cast_points <- function(df, ptype_regex) {
  # takes a data frame in "long-ish" format
  # (identifying columns + point_type, point_value),
  # and extracts a wide-format data frame
  # note: no aggregation is done, which makes sense if your data contains
  #       lots of replicate measurements you want to keep separate
  # note: this also works best when you have values for each chosen point_type
  #       for each combination of other columns. if not, you get lots of NAs
  idcols <- colnames(df)[!colnames(df) %in% c('point_type', 'point_value')]
  ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
  gbdots <- lapply(c(idcols, 'point_type'), as.symbol)
  dcform <- paste(paste(c(idcols, 'counter'), collapse='+'), '~', 'point_type')
  df <- df[df[['point_type']] %in% ptypes,]
  df <- df %>%
    group_by_(.dots=gbdots) %>%
    mutate(counter=1:n()) %>%
    dcast(formula=dcform, value.var="point_value", drop=TRUE)
  df[['counter']] <- NULL
  return(df)
}

head(data)
#   strain resistance light  light_bin hours point_type point_value
# 1  blank       <NA>    25 025-025 uE    17      od750      0.4058
# 2  blank       <NA>    25 025-025 uE    17      od750      0.1252
# 3  blank       <NA>    25 025-025 uE    17      od750      0.1993
# 4  blank       <NA>    25 025-025 uE    17      od750      0.1343
# 5  blank       <NA>    25 025-025 uE    17      od750      0.0555
# 6  blank       <NA>    25 025-025 uE    17      od750      0.0881
tail(data)
#        strain resistance light  light_bin hours point_type point_value
# 93516 strain1         Km    25 025-025 uE    17  log_od750  -0.9670579
# 93517 strain1         Km    25 025-025 uE    17  log_od750  -1.4605870
# 93540 strain1         Km    25 025-025 uE    17  log_od750  -1.8300846
# 93542 strain1         Km    25 025-025 uE    17  log_od750  -1.1779802
# 93554 strain1         Km    25 025-025 uE    17  log_od750  -2.0448469
# 93556 strain1         Km    25 025-025 uE    17  log_od750  -1.8413700

head(cast_points(data, 'log'))
#   strain resistance light  light_bin hours  log_od750
# 1  blank       <NA>    25 025-025 uE    17 -0.9018948
# 2  blank       <NA>    25 025-025 uE    17 -2.0778428
# 3  blank       <NA>    25 025-025 uE    17 -1.6129440
# 4  blank       <NA>    25 025-025 uE    17 -2.0076792
# 5  blank       <NA>    25 025-025 uE    17 -2.8913722
# 6  blank       <NA>    25 025-025 uE    17 -2.4292827

head(cast_points(data, 'od'))
#   strain resistance light  light_bin hours  log_od750  od750
# 1  blank       <NA>    25 025-025 uE    17 -0.9018948 0.4058
# 2  blank       <NA>    25 025-025 uE    17 -2.0778428 0.1252
# 3  blank       <NA>    25 025-025 uE    17 -1.6129440 0.1993
# 4  blank       <NA>    25 025-025 uE    17 -2.0076792 0.1343
# 5  blank       <NA>    25 025-025 uE    17 -2.8913722 0.0555
# 6  blank       <NA>    25 025-025 uE    17 -2.4292827 0.0881

EDIT: In case anyone else got this far, here are a couple other functions for manipulating the same kind of data:

remove_points <- function(df, ptype_regex) {
  # takes a data frame in "long-ish" format
  # (identifying columns + point_type, point_value),
  # and removes rows with point_types matching the regex
  # useful for removing points that you just extracted with cast_points
  ptypes <- unique(grep(ptype_regex, df[['point_type']], value=TRUE))
  df[!df[['point_type']] %in% ptypes,]
}

append_points <- function(df1_longish, df2_wide) {
  # takes one main data frame in "long-ish" format and a second in wide format
  # melts the wide one into long-ish and appends to the main one
  idcols <- colnames(df1_longish)[!colnames(df1_longish)
                                  %in% c('point_type', 'point_value')]
  df2_longish <- melt(df2, variable.name='point_type',
                              value.name='point_value', id.vars=idcols)
  rbind(df1_longish, df2_longish)
}

sac_points <- function(df, fn, ptype_regex) {
  # split-apply-combine for "long-ish" data
  misc <- drop_points(df, ptype_regex)
  df   <- cast_points(df, ptype_regex)
  df   <- fn(df)
  append_points(misc, df)
}
Community
  • 1
  • 1
jefdaj
  • 2,025
  • 2
  • 21
  • 33
  • 2
    `subset(your_data, point_type == "log_od750")`? If the columns you want are the columns you have, there's no reshaping going on, just subsetting. – Gregor Thomas Jul 23 '15 at 18:38
  • I think subsetting could work. I wanted to keep the question simple but there will actually be multiple point_type columns to keep like locfit and locfit_err at the same time. Can that be done with subsetting too? – jefdaj Jul 23 '15 at 18:47
  • The way your question is phrased, all the columns that you want are already there, which means subsetting. Reshaping long to wide means taking creating new columns based on values in existing columns (and having fewer rows). Reshaping wide to long is taking multiple columns and condensing them into two columns (adding more rows). – Gregor Thomas Jul 23 '15 at 18:51
  • You're right I'll rephrase the question. I want to reshape long to wide. – jefdaj Jul 23 '15 at 18:53

1 Answers1

1

It looks like you've got two separate issues: You want to keep only some levels of point_type, and you then want the levels of point_type to become columns. If so, you can do this:

library(reshape2)

# Subset
data.keep = data[grep("od750|loc", data$point_type), ]

# Convert to wide format
data.keep.wide = dcast(data.keep, strain + light + hours ~ point_type, 
                       value.var="point_value") 

The first line of code will keep all rows for which point_type contains the text "od750" or "loc" (I just chose these for illustration). The second line converts the remaining levels of point_type to columns.

Here the first few lines of the result:

data.keep.wide

     strain light hours locfit locfit_err log_od750 od750
1   strain1    24    48     NA         60        NA    NA
2   strain1    67    74     16         NA        NA    NA
3   strain1    78    20      7         NA        NA    NA
4   strain1    83     5     NA         NA        58    NA
5   strain1    95    84     NA         NA        NA    47
6  strain10    13    53     NA         NA        35    NA
7  strain10    42    90     NA         NA        NA    78

Note that in each row only one of the new columns contains a value. Based on the information in your question, I'm assuming that your real data has a point_value for each level of point_type for each hour.

UPDATE: Your comments seem to indicate that your data can have multiple rows for each unique combination of strain, light, hours and point_type. If that's the case, and you want to keep all of them, then you can do the following after subsetting. The code below groups the data by all the relevant columns, then adds counter, which provides a unique "ID" to each row of a given group. Then we include counter on the LHS of dcast to keep all rows, without aggregation:

library(reshape2)
library(dplyr)

# Convert to wide format, keeping all rows
data.keep.wide = data.keep %>%
  group_by(strain, light, hours, point_type) %>%
  mutate(counter=1:n()) %>%
  dcast(strain + light + hours + counter ~ point_type, value.var="point_value")
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • This is very close to what I want! Only problem, which you can't tell from the sample data very well, is that multiple values get aggregated (using `length` by default, but could switch to `mean`). But what I want is to keep all of them. – jefdaj Jul 23 '15 at 19:23
  • Ah figured it out. Just need to add an id column like `row_id` and add it to the `dcast` too, then remove afterward. – jefdaj Jul 23 '15 at 19:56
  • Nevermind, that doesn't work because it prevents combining multiple columns. – jefdaj Jul 23 '15 at 20:13
  • Can you post a sample of data that matches the structure of your real data along with an example of what you want that data to look like after you've cast it to wide format? Are you saying that your real data can have multiple rows for each combination of, say, strain, light, hours, and od750? – eipi10 Jul 23 '15 at 21:19
  • Yes that's right. Sorry it wasn't very well presented. Your solution worked anyway though with a little tweaking. I'm honestly pretty surprised reshape2 doesn't allow replicate measurements. You'd think it would come up a lot! – jefdaj Jul 23 '15 at 23:29