2

I'm working with a data frame of size 2 x 400. I need to graph this (let's call it data set A) on the same graph as the main data set for my project.

All I need is the general shape of data set A's graph. ie i only need to see the trend.

The scale that data set A takes place on happens to be much smaller than that of the main graph. So dataset A just looks like a horizontal line.

I decided to scale data set A by multiplying it by a factor of... I tried various values to get the optimum vertical scaling, which leads me to the problem I'm having.

When trying to find the ideal multiplicative factor by trial and error, I expected the general shape of data set A's graph to retain its shape, and only vary in its relative vertical points . ie the horizontal coordinates of all maxes and mins shouldn't move, and only the vertical points should be moving. but this wasn't happening. I'd like to know why.

Here's the data set A (yellow), when multiplied by factor of 3:

enter image description here

factor of 5:

enter image description here

The yellow dots are the geom_point and the yellow curve is the corresponding geom_smooth.

EDIT: here is my the code original code: I haven't had much formal training with code. I'm apologize for any messiness!

library("ggplot2")
library("dplyr")

# READ IN DATA
temp_data <-read.table(col.names = "y",
  "C:/Users/Ben/Documents/Visual Studio 2013/Projects/Home/Home/steamdata2.txt")

boilpoint <- which(temp_data$y == "boil")    # JUST A MARKER..
temp_data <- filter(temp_data, y != "boil")  # GETTING RID OF THE MARKER ENTRY

# DON'T KNOW WHY BUT I HAD TO DO THIS INTERMEDIATE STEP
# BEFORE I COULD CONVERT FROM FACTOR -> NUMERIC
temp_data$y <- as.character(temp_data$y)        

# CONVERTING TO NUMERIC   
temp_data$y <- as.numeric(temp_data$y)          

# GETTING RID OF BASICALLY THE LAST ENTRY WHICH HAS THE LARGEST VALUE
temp_data <- filter(temp_data, y<max(temp_data$y)) 

# ADD ANOTHER COLUMN WITH THE ROW NUMBER,
# BECAUSE I DON'T KNOW HOW TO ACCESS THIS FOR GGPLOT
temp_data <- transform(temp_data, x = 1:nrow(temp_data))   


n <- nrow(temp_data)         # Num of readings
period <- temp_data[n,1]     # (sec)
RpS <- n / period            # Avg Readings per Second

MIN <- min(temp_data$y)
MAX <- max(temp_data$y)

# DERIVATIVE OF ORIGINAL
deriv <- data.frame(matrix(ncol=2, nrow=n))  

# ADD ANOTHER COLUMN TO ACCESS ROW NUMBERS FOR GGPLOT LATER     
colnames(deriv) <- c("y","x")
deriv <- transform(deriv, x = c(1:n))         

# FILL DERIVATIVE DATAFRAME
deriv[1, 1] <- 0
for(i in 2:n){              
  deriv[i - 1, 1] <- temp_data[i, 1] - temp_data[i - 1, 1]
}
deriv <- filter(deriv, y != 0)

# DID THE SAME FOR SECOND DERIVATIVE
dderiv <- data.frame(matrix(ncol = 2, nrow = nrow(deriv)))
colnames(dderiv) <- c("y", "x")
dderiv <- transform(dderiv, x=rep(0, nrow(deriv)))
dderiv[1, 1] <- 0
for(i in 2:nrow(deriv)) {
  dderiv$y[i - 1] <- (deriv$y[i] - deriv$y[i - 1]) /
                         (deriv$x[i] - deriv$x[i - 1])
  dderiv$x[i - 1] <- deriv$x[i] + (deriv$x[i] - deriv$x[i - 1]) / 2
}
dderiv <- filter(dderiv, y!=0)

# HERE'S WHERE I FACTOR BY VARIOUS MULTIPLES 
deriv <- MIN  + deriv * 3        
dderiv <- MIN  + dderiv * 3      

graph <- ggplot(temp_data, aes(x, y)) + geom_smooth()
graph <- graph + geom_point(data = deriv, color = "yellow")
graph <- graph + geom_smooth(data = deriv, color = "yellow")
graph <- graph + geom_point(data = dderiv, color = "green")
graph <- graph + geom_smooth(data = dderiv, color = "green")
graph <- graph + geom_vline(xintercept = boilpoint, color = "red")
graph <- graph + xlab("Readings (n)") +
    ylab(expression(paste("Temperature  (",degree,"C)")))
graph <- graph + xlim(c(0,n)) + ylim(c(MIN, MAX))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Ben Marconi
  • 161
  • 1
  • 1
  • 7
  • 2
    two suggestions: plot dataset A on a different facet OR normalise all values to the same scale, e.g. with `scales::rescale()` – baptiste Feb 12 '16 at 03:12
  • You have multiplied your x values as well somewhere by accident. Without the code I can't say where. The first of the higher points is at around (4500, 23.95) on the first graph and (7500, 24.1) on the second. – timcdlucas Feb 12 '16 at 09:18
  • I included the code in my original post, -timcdlucas. I hope someone can find something, I need a fresh set of eyes... I haven't yet tried your suggestions baptiste, I'll go learn about the function and let you know how it goes. – Ben Marconi Feb 13 '16 at 03:49
  • I've edited your code to try to simplify for the question - I made linebreaks so we don't have to horizontal scroll on SO and a removed the theme stuff as its irrelevant for this question. I also added spaces consistently around binary operator and after commas to help it read a little cleaner. – Gregor Thomas Feb 13 '16 at 06:50

1 Answers1

1

It's hard to check without your raw data, but I'm 99% sure that your main problem is that you're hard-coding the y limits with ylim(c(MIN, MAX)). This is exacerbated by accidentally scaling both variables in your deriv and dderiv data frame, not just y.

I was able to debug the problem when I noticed that your top "scale by 3" graph has a lot more yellow points than your bottom "scale by 5" graph.

The quick fix is don't scale the row numbers, only scale the y values, which is to say, replace this

# scales entire data frame: bad!
deriv <- MIN  + deriv * 3        
dderiv <- MIN  + dderiv * 3 

with this:

# only scale y
deriv$y <- MIN  + deriv$y * 3        
dderiv$y <- MIN  + dderiv$y * 3 

I think there is another problem too: even with my correction above, negative values of your derivatives will be excluded. If deriv$y or dderiv$y is ever negative, then MIN + deriv$y * 3 will be less than MIN, and since your y axis begins at MIN it won't be plotted.

So I think the whole fix would be to instead do something like

# keep the original y values around so we can experiment with scaling
# without running *all* the code again

deriv$y_orig <- deriv$y
# multiplicative scale
# fill in the value of `prop` to be the proportion of the vertical plot area
# that you want taken up by the derivative
deriv$y <- deriv$y_orig * diff(c(MIN, MAX)) / diff(range(deriv$y_orig)) * prop
# shift into plot range
# fill in the value of `intercept` to be the y value of the
# lowest point of this line
deriv$y <- deriv$y + MIN - min(deriv$y) + 1

I normally don't answer questions that aren't reproducible with data because I hate lack of clarity and I hate the inability to test. However, your question was very clear and I'm pretty sure this will work even without testing. Fingers crossed!


A few other, more general comments:

  1. It's good you know that to convert factor to numeric you need to go via character. It's an annoyance, but if you want to understand more here's the r-faq on it.

  2. I'm not sure why you bother with (deriv$x[i] - deriv$x[i - 1]) in your for loop. Since you define x to be 1, 2, 3, ... the difference is always 1. I'm more confused by why you divide by 2 in the second derivative.

  3. Your for loop can probably be replaced by the diff() function. (See below.)

  4. You seem to have just gotten your foot in the dplyr door, so I used base functions in my recommendation. Keep working with dplyr, I think you'll like it. The big dplyr function you're not using is mutate. It works like base::transform for adding new columns.

  5. I dislike that you've created all these different data frames, it clutters things up. I think your code could be simplified to something like this

    all_data = filter(temp_data, y != "boil") %>%
        mutate(y = as.numeric(as.character(y))) %>%
        filter(y < max(y)) %>%
        mutate(
            x = 1:n(),
            deriv = c(NA, diff(y)) / c(NA, diff(x)),
            dderiv = c(NA, diff(deriv)) / 2
        )
    

Rather than having separate data frames for the original data, first derivative and second derivative, this puts them all in the same data frame.

  1. The big benefit of having things in one data frame is that you could then "gather" it into a nice, long (rather than wide) tidy format and simplify your plotting call:

    library(tidyr)
    long_data = gather(all_data, key = function, value = y, y, deriv, dderiv)
    

Then your ggplot call would look more like this:

graph <- ggplot(temp_data, aes(x, y, color = function)) + 
   geom_smooth() +
   geom_point() +
   geom_vline(xintercept = boilpoint, color = "red") +
   scale_color_manual(values = c("green", "yellow", "blue")) +
   xlab("Readings (n)") +
   ylab(expression(paste("Temperature  (",degree,"C)"))) +
   xlim(c(0,n)) + ylim(c(MIN, MAX))

With data in long format, you'd have a column of you data (I've named it "function") that maps to color, so you don't have to add all the layers one at a time, and you get a nicely generated legend!

Community
  • 1
  • 1
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294