0

I'm currently working on automating some basic experiential analysis using R. Currently, I've got my script setup as follows which generates the plot shown below.

data <- list()
for (experiment in experiments) {
    path = paste('../out/', experiment, '/', plot, '.csv', sep="")
    data[[experiment]] <- read.csv(path, header=F)
}

df <- data.frame(Year=1:40,
                 'current'=colMeans(data[['current']]),
                 'vip'=colMeans(data[['vip']]),
                 'vipbonus'=colMeans(data[['vipbonus']]))

df <- melt(df, id.vars = 'Year', variable.name = 'Series')
plotted <- ggplot(df, aes(Year, value)) +
           geom_line(aes(colour = Series)) +
           labs(y = ylabel, title = title)

file = paste(plot, '.png', sep="")
ggsave(filename = file, plot = plotted)

enter image description here

While this is close to what we want the final product to look like, the series labels need to be updated. Ideally we want them to be something like "VIP, no bonus", "VIP, with bonus" and so forth, but obviously using labels like that in the data frame is not valid R (and invalid characters are automatically replaced with . even with backticks). Since these experiments are a work in progress, we also know that we are gong to need more series labels in the future so we don't want to lose the ability of ggplot to automatically set the colors for us.

How can I set the series labels to be appropriate for humans?

rjzii
  • 14,236
  • 12
  • 79
  • 119
  • Once the data are molten, you can change the names/labels as you like because they are only data then (character or factor). The restrictions you are refering to are limited to variable/column names in R. – Uwe Apr 09 '17 at 23:44
  • @UweBlock Well shortly after I posted this I came up with a way of doing it. I'd encourage you to post what you have to see if we have the same idea in mind or not. – rjzii Apr 09 '17 at 23:53
  • @rjzii Could you please add your data.frame to the question? – shiny Apr 10 '17 at 03:34
  • The issue with renaming labels is solved now. However, I believe the data preparation process could be streamlined starting with the aggregation of the raw `data`. So, will it be possible for you to [edit] your Q and add the result of `dput(data)`? Or, a selection of a few years? Thank you. – Uwe Apr 10 '17 at 06:33
  • @UweBlock Updated! The data (floating point values) is being written to a CSV file by another program and we have some control over how the directories are created (e.x., "/[experiment]/[dataset]") and we are generating the multi-line means plots (although eventually we will want to do something [like this](http://stackoverflow.com/questions/26020142/). – rjzii Apr 10 '17 at 13:45
  • Please, can you show the result of `str(data)` or give a sample of `data` using `dput()`? Thanks – Uwe Apr 10 '17 at 13:51
  • @UweBlock I'm not sure exactly what you are hoping to see? The files I'm loading are standard CSV files of floating point values of NxM size (currently 20x40, subject to change). I'm just dumping the loaded tables into the `list` for convenience purposes since I know that in the future I'm going to want to automate the entire process. – rjzii Apr 10 '17 at 15:40
  • @rjzii Thank you for the additional information. This helped to suggest a way to streamline your data preparation steps which also addresses the serie labeling issue in a natural way. – Uwe Apr 11 '17 at 07:45

3 Answers3

2

While this may not be an ideal approach, what we found that worked for us was to update the relevant series labels after the melt command was performed:

df$Series <- as.character(df$Series)
df$Series[df$Series == "current"] <- "Current"
df$Series[df$Series == "vip"] <- "VIP, no bonus"
df$Series[df$Series == "vipbonus"] <- "VIP, with bonus"

Which results in plots like the following:

enter image description here

rjzii
  • 14,236
  • 12
  • 79
  • 119
  • Well done. That's what I had in mind when posting [my comment](http://stackoverflow.com/questions/43313075/how-do-i-set-the-series-labels-in-a-multiline-ggplot2-series#comment73691091_43313075). However, I believe the overall data preparation process could be streamlined. – Uwe Apr 10 '17 at 06:27
  • @UweBlock Indeed. Given the nature of the experiments eventually I'm going to be looking at trying to automate the loading of everything based upon the directory structures. Hand coding is fine for now, but long term it's going to get painful. – rjzii Apr 10 '17 at 13:36
2

The OP explained that he is currently working on automating some basic experiential analysis, part of which is the relabeling of the series. The OP showed also some code which is used to prepare the data to be plotted.

Based on the additional information supplied in comments, I believe the overall processing could be streamlined which will address the series labeling issue as well.

Some preparations

# used for creating file paths
experiments <- c("current", "vip", "vipbonus")
# used for labeling the series
exp_labels <- c("Current", "VIP, no bonus", "VIP, with bonus")
plot <- "dataset1"   # e.g.
paths <- paste0(file.path("../out", experiments, plot), ".csv") 
paths
#[1] "../out/current/dataset1.csv"  "../out/vip/dataset1.csv"      "../out/vipbonus/dataset1.csv"

Read data

library(data.table)   #version 1.10.4 used here
# read all files into one large data.table
# add running count in column "Series" to identify the source of each row
DT <- rbindlist(lapply(paths, fread, header = FALSE), idcol = "Series")
# rename file chunks = Series, use predefined labels
DT[, Series := factor(Series, labels = exp_labels)]

Reshape and aggregate by groups

# reshape from wide to long
molten <- melt(DT, id.vars = "Series")
# compute means by Series and Year = variable
aggregated <- molten[, .(value = mean(value)), by = .(Series, variable)]
# take factor level number of "variable" as Year
aggregated[, Year := as.integer(variable)]

Note that aggregation is done in long format (after melt()) to save typing the same command for each column.

Create chart & save to disk

library(ggplot2)
ggplot(aggregated, aes(Year, value)) +
  geom_line(aes(colour = Series)) +
  labs(y = "ylabel", title = "title")

file = paste(plot, '.png', sep="")
ggsave(filename = file)   # by default, the last plot is saved
Uwe
  • 41,420
  • 11
  • 90
  • 134
1

You can try this

library(tidyverse)
df <- df %>% dplyr::mutate(Series = as.character(Series),
                           Series = fct_recode(Series,
                                              "Current" = "current",
                                              "VIP, no bonus" = "vip", 
                                              "VIP, with bonus" = "vipbonus")) 
shiny
  • 3,380
  • 9
  • 42
  • 79
  • Do you know if `ftc_recode` supports remapping the series based upon a table of values (i.e., names and captions)? – rjzii Apr 10 '17 at 15:42