Newbie attempting linear mixed effects model in R studio - TOTAL FAIL

Question

After searching over an hour (this forum, Youtube, class notes, google) I've found no help for my question. I'm a complete newb who knows nothing about R or stats.

I'm attempting to create a linear mixed effects model in R. I'm measuring leaf width across three different locations (Jacksonville FL, Augusta GA, & Atlanta GA), and within those three locations there is a high-nitrogen and low-nitrogen plot. I have 150 leaf measurements from 50 trees.

My limited understanding tells me that the leaf width is the continuous response variable, and city and plot are the discrete explanatory variables. The random effect would be the individual trees, since the leaf width within a single tree is non-independent.

I've used "nlme" to make a model:

leaf.width.model <- lme(width ~ city*plot, (1|tree.id), data=leaf)

I then ran an ANOVA test, and it suggested there's something going on with city and the interaction between city and plot. This is where I'm stuck. I want to make a plot that has lines for all three cities, but I haven't a clue how to do that. When I try to use the plot function, I just get a boxplot.

I've literally tried for hours and am more lost and confused than before.

1) How can I make this graph?

2) What other tests should I do to analyze and/or visualize this data?

I am forever grateful for any help at all. I really want to learn R and stats very badly, but I'm getting discouraged.

Thank you,

Rich

P.S Here is the output of the dput function:

> dput(tree) structure(list(tree.id = structure(c(24L, 24L, 32L, 25L, 25L, 24L, 24L, 32L, 25L, 25L, 43L, 45L, 45L, 43L, 23L, 23L, 45L, 45L, 23L, 23L, 41L, 41L, 38L, 11L, 11L, 38L, 41L, 41L, 11L, 11L, 14L, 14L, 29L, 13L, 13L, 14L, 14L, 29L, 13L, 13L, 4L, 4L, 1L, 1L, 20L, 1L, 1L, 20L, 6L, 8L, 8L, 5L, 5L, 6L, 4L, 4L, 8L, 8L, 5L, 5L, 9L, 9L, 10L, 10L, 12L, 12L, 13L, 13L, 22L, 22L, 23L, 23L, 24L, 24L, 25L, 25L, 25L, 25L, 40L, 40L, 41L, 41L, 38L, 38L, 39L, 39L, 14L, 14L, 14L, 15L, 15L, 28L, 28L, 29L, 29L, 35L, 35L, 36L, 36L, 37L, 37L, 42L, 42L, 43L, 43L, 44L, 44L, 45L, 45L, 46L, 46L, 47L, 47L, 2L, 1L, 3L, 3L, 4L, 4L, 7L, 11L, 11L, 16L, 16L, 20L, 20L, 21L, 21L, 17L, 17L, 18L, 18L, 19L, 19L, 26L, 26L, 27L, 27L, 30L, 30L, 31L, 31L, 32L, 32L, 33L, 33L, 34L, 34L, 48L), .Label = c("Tree_112", "Tree_112 ", "Tree_115", "Tree_130", "Tree_137", "Tree_139", "Tree_140", "Tree_141", "Tree_153", "Tree_154", "Tree_156", "Tree_159", "Tree_166", "Tree_169", "Tree_171", "Tree_180", "Tree_182", "Tree_184", "Tree_185", "Tree_202", "Tree_213", "Tree_218", "Tree_222", "Tree_227", "Tree_239", "Tree_242", "Tree_246", "Tree_247", "Tree_252", "Tree_260", "Tree_267", "Tree_269", "Tree_271", "Tree_272", "Tree_291", "Tree_293", "Tree_298", "Tree_327", "Tree_329", "Tree_336", "Tree_350", "Tree_401", "Tree_403", "Tree_405", "Tree_407", "Tree_409", "Tree_420", "Tree_851"), class = "factor"), city = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Atlanta", "Augusta", "Jacksonville"), class = "factor"),     plot = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("High-N", "Low-N"), class = "factor"), width = c(0.66, 0.716, 0.682, 0.645, 0.645, 0.696, 0.733,
0.707, 0.668, 0.686, 0.617, 0.733, 0.73, 0.615, 0.669, 0.746, 0.687, 0.682, 0.76, 0.713, 0.651, 0.664, 0.679, 0.729, 0.756, 
0.669, 0.647, 0.713, 0.767, 0.685, 0.69, 0.731, 0.781, 0.729, 
0.725, 0.739, 0.769, 0.791, 0.676, 0.688, 0.719, 0.753, 0.748, 
0.791, 0.785, 0.78, 0.723, 0.756, 0.664, 0.645, 0.653, 0.615, 
0.591, 0.642, 0.693, 0.716, 0.694, 0.676, 0.662, 0.629, 0.665, 
0.748, 0.726, 0.693, 0.715, 0.714, 0.764, 0.732, 0.61, 0.721, 
0.703, 0.713, 0.746, 0.752, 0.662, 0.733, 0.707, 0.674, 0.734, 
0.79, 0.732, 0.794, 0.703, 0.712, 0.737, 0.731, 0.747, 0.746, 
0.787, 0.709, 0.716, 0.764, 0.77, 0.764, 0.802, 0.663, 0.777, 
0.642, 0.779, 0.81, 0.724, 0.645, 0.68, 0.637, 0.695, 0.768, 
0.761, 0.7, 0.759, 0.726, 0.696, 0.794, 0.774, 0.799, 0.747, 
0.606, 0.691, 0.733, 0.707, 0.698, 0.706, 0.72, 0.694, 0.697, 
0.737, 0.716, 0.73, 0.706, 0.667, 0.734, 0.528, 0.695, 0.684, 
0.763, 0.733, 0.809, 0.6, 0.676, 0.718, 0.759, 0.609, 0.665, 
0.667, 0.647, 0.701, 0.663, 0.688, 0.693, 0.899)), .Names = c("tree.id", "city", "plot", "width"), class = "data.frame", row.names = c(NA, -149L))

Thank you all so much for your comments, I sincerely appreciate everyone's help!

Hey Richard, welcome to SO! Can you provide us a snapshot of your data using, for example, `dput`? That'll make it easier to help (see this for more: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, you don't have to be self-deprecating in order to get help here. In fact, all we care about is the content of your question, not how it originated (unless relevant) and not how good you are in either R or stats. — tifu, Mar 22 '18 at 13:02
I'm not sure a line plot makes sense. Putting the continuous response on the y-axis is clear, but you don't have anything continuous to put on the x-axis. I'm also not clear about your goal for the plot: do you want to plot the raw data or the model prediction (or both)? If the raw data, a boxplot might make sense, perhaps with different colors of boxes for each plot. — Gregor Thomas, Mar 22 '18 at 13:10

score 1 · Accepted Answer · answered Mar 22 '18 at 14:35

As suggested in comments, a line plot might not make sense for your data, as you are studying how width varies in discrete categories (in separate cities and separate plots). Boxplots would make sense as you can make them for each of the interactions of city and plot. To give you a sense of what you can do I generated some fake data and made an example of the sort of plot that might be helpful to you:

# fake data
leaf <- data.frame(tree.id = rep(1:50, each = 3), 
                   city = rep(c("Jackson", "Augusta", "Atlanta"), each = 50),
                   plot = rep(1:6, each = 25))
# I'll make the average of width different for each plot
leaf$width <- rnorm(nrow(leaf), leaf$plot, 1)

# plotting the data
library(ggplot2) # this is a great library for plotting in R

ggplot(leaf, aes(x = factor(plot), y = width, color = factor(plot))) + 
  facet_grid(~city, scales = 'free_x') + # This creates a subplot for each city
  geom_boxplot() + 
  geom_point(position = "jitter") + 
  theme_bw()

In this plot I added the points (the leaf widths for each individual tree) but I 'jittered' them, meaning perturbing their position slightly so that they do not pile up on top of each other and are all visible. You could remove this if you liked.

Exploratory data analysis should be fun! And I think visualization is a good place to start when beginning in statistics. Hopefully this will prove helpful to you.

gfgm, thank you so much for you help! Is there a way to represent the mean of each group with something like a triangle or star? I searched "how to add mean to ggplot," but couldn't find anything that would work. Thank you again. — Richard Gourderton, Mar 30 '18 at 14:06
you could put a big black cross at the mean of each group by appending `+ stat_summary(fun.y = "mean", geom = "point", color = 'black', size=10, shape = "+")` to the end of the plotting function call. If the answer is useful upvote it! — gfgm, Mar 30 '18 at 14:17
One last thing! How do I go about switching the order of the box plots to make it go from Jacksonville to Augusta to Atlanta? I tried using the "c" concatenate function, but it either just deletes all the cities, or creates an separate variable object called "city" that has only 3 things in it (just the cities) Thank you so much! I upvoted and marked as the best answer — Richard Gourderton, Apr 05 '18 at 16:16

Rajesh Talluri · Answer 2 · 2018-04-03T14:34:21.880

leaf.width.model <- lme(width ~ city*plot, (1|tree.id), data=leaf)

In this model if you want to plot something, you are probably trying to answer: How much is the average leaf width for all trees in each city for each type of plot.

To show this information in a figure, you need to plot width on y axis plot plot(high and low nitrogen) on x axis and group the data by city. Then you will get the 3 lines you are taking about. However, you need to get the average width in each group as you only want to show city variation.

To get this plot from raw data: (Using fake data provided by gfgm)

set.seed(100)
leaf <- data.frame(tree.id = rep(1:50, each = 3), 
                   city = rep(c("Jackson", "Augusta", "Atlanta"), each = 50),
                   plot = rep(c(1, 0), each = 25))

# I'll make the average of width different for each plot
leaf$width <- rnorm(nrow(leaf), leaf$plot, 1)

library(plotly)
library(tidyverse)
leaf %>% 
  group_by(city,plot) %>% 
  summarise(avwidth = mean(width, na.rm=T),
            avsd = 1.96*sd(width, na.rm=T)/sqrt(25)) %>%
  plot_ly(x = ~plot, y = ~avwidth, color= ~city, 
          type="scatter", mode="markers+lines", 
          error_y = ~list(array=avsd)
          )

I just tried it and it works awesome! So if I understand, the points represent the average width for that group (e.g., Atlanta, High-N). Is there a way to add an error bar to the points? Thank you so much Rajesh Talluri! — Richard Gourderton, Mar 30 '18 at 13:58
I edited the answer to add standard errors. Here I am just using the normal symmetric standard errors mean+-1.96*sd/sqrt(n) — Rajesh Talluri, Apr 03 '18 at 14:35
Thank you so much! I upvoted your answer (though it says it won't show because I'm new). — Richard Gourderton, Apr 05 '18 at 00:15

Newbie attempting linear mixed effects model in R studio - TOTAL FAIL

2 Answers2