R ggplot or other: How can I plot a line of best fit for grouped AND paired data on the same plot?

Question

I would like to plot 5 lines of best fit summarising 5 groups of paired points (i.e. summarising the lines between the two points)

I have a data set structured like this (readings of yield and rainfall for 500 farms grouped into 5 production systems for 2 years. Each farm occurs twice in the data set 2015 and 2017):

FARM ID (cocoa$XACIB) // YEAR //  RAINFALL (cocoa$X15_clim) //  YIELD (cocoa$TY_15.1.16_HA) // SYSTEM (cocoa$Production_system)

1   // 2015 //  1200    //  500    //  A

1   // 2017 //  1400    // 600     //  A

2   // 2015 //  900    // 400      //  B

Reproducible data set at bottom...

I want to create a scatter plot of RAINFALL (x axis) vs YIELD (y axis) with points grouped by SYSTEM and a line of best fit for each of the 5 SYSTEMS based on the paired change in YIELD for each ID (Farm).

Currently I have got stuck with, generating a graph of paired data but no trendline:

ggplot(cocoa, aes(x=cocoa$X15_clim, y=cocoa$TY_15.1.16_HA,  color=cocoa$Production_system)) +
  geom_point() +
  geom_line(group=cocoa$XACIB)

paired data with lines but no trendlines

or stuck with a graph of grouped data with trendlines for production systems but not baed on the pairs based on all the data for each production system:

ggplot(cocoa, aes(x=cocoa$X15_clim, y=cocoa$TY_15.1.16_HA,  color=cocoa$Production_system)) +
  geom_point() +
  geom_smooth(method=lm, se =FALSE)

trend lines for groups but not pairs...

Please could someone help me find a way to produce a trend line for each group based on the paired data?

MANY THANKS!

dput(cocoa)

structure(list(Production_system = structure(c(4L, 4L, 4L, 4L, 
1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 1L, 3L, 3L, 1L, 1L, 
3L, 3L, 5L, 5L, 5L, 5L, 4L, 4L, 3L, 3L, 4L, 4L, 3L, 3L, 2L, 2L, 
5L, 5L, 3L, 3L, 5L, 5L, 5L, 5L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L, 5L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L
), .Label = c("A_CONV1", "B_UTZ", "C_ORG", "D_CONV2", "E_RA"), class = "factor"), 
    XACIB = c(22000, 22000, 30000, 30000, 35701, 35701, 42016, 
    42016, 44880, 44880, 46800, 46800, 48000, 48000, 56672, 56672, 
    62119, 62119, 65318.4, 65318.4, 76916, 76916, 79040, 79040, 
    82194, 82194, 85800, 85800, 90576, 90576, 90720, 90720, 92252.8, 
    92252.8, 93728, 93728, 94080, 94080, 105473, 105473, 105600, 
    105600, 108160, 108160, 108820, 108820, 111385.2, 111385.2, 
    112060, 112060, 113568, 113568, 116280, 116280, 119145, 119145, 
    120392, 120392, 128625, 128625, 131040, 131040, 135000, 135000, 
    138256, 138256, 143208, 143208), TY_15.1.16_HA = c(459.6153, 
    306.4102, NA, 905.3028636, NA, 229.80765, 229.80765, 612.8204, 
    NA, 15.95886458, 306.4102, 204.2734667, 344.711475, 689.42295, 
    229.80765, 383.01275, 306.4102, 229.80765, 102.1367333, 153.2051, 
    183.84612, 344.711475, 61.28204, 91.92306, 901.2064706, 1201.608627, 
    919.2306, 766.0255, 536.21785, 153.2051, 265.5555067, 306.4102, 
    NA, 229.80765, 95.7531875, 287.2595625, 510.6836667, 638.3545833, 
    127.6709167, 306.4102, 14.36297813, 153.2051, 612.8204, 1021.367333, 
    383.01275, 510.6836667, 87.54577143, 131.3186571, 200.3451308, 
    223.9151462, 107.24357, 199.16663, 868.1622333, 800.0710778, 
    465.289563, 635.5174519, 932.3624657, 875.4577143, 357.4785667, 
    510.6836667, 255.3418333, 153.2051, 218.8644286, 21.88644286, 
    103.8922084, 146.5023769, 282.8401846, 306.4102), X15_clim = c(1148.415157, 
    1413.089005, 1148.415157, 1413.089005, 1214.720448, 1257.496108, 
    1136.028355, 1383.640551, 1270.817029, 1297.849306, 1270.817029, 
    1297.849306, 1136.028355, 1383.640551, 1244.262682, 1281.788645, 
    1214.136308, 1279.60737, 1244.262682, 1281.788645, 1270.817029, 
    1297.849306, 1118.994984, 1350.821456, 1125.50238, 1370.241065, 
    1136.028355, 1383.640551, 1214.136308, 1279.60737, 1126.707093, 
    1391.304566, 1245.721294, 1307.936407, 1240.241637, 1267.070353, 
    1125.50238, 1370.241065, 1270.817029, 1297.849306, 1094.358686, 
    1346.225715, 1118.994984, 1350.821456, 1214.720448, 1257.496108, 
    1246.361514, 1266.872479, 1245.721294, 1307.936407, 1136.028355, 
    1383.640551, 1118.994984, 1350.821456, 1136.028355, 1383.640551, 
    1136.028355, 1383.640551, 1214.136308, 1279.60737, 1139.349273, 
    1397.517303, 1148.415157, 1413.089005, 1214.720448, 1257.496108, 
    1214.720448, 1257.496108)), class = "data.frame", row.names = c(NA, 
-68L))

Can you review the code in your question ? Currently, the code display does not reflect each graph obtained (see x and y axis title) — dc37, Apr 01 '20 at 19:55
Also, please provide a reproducible example of your dataset by following this guide: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example it will help people to assist you — dc37, Apr 01 '20 at 19:56
Ah sorry i changed the code to make it easier to read with the text (first time on here!). Actual code: ggplot(cocoa, aes(x=cocoa$X15_clim, y=cocoa$TY_15.1.16_HA, color=cocoa$Production_system)) + geom_point() + geom_smooth(method=lm, se =FALSE) — Billy Faucon, Apr 01 '20 at 19:58
Please edit your question to reflect those changes. It will be easier for people to follow your question without to dig informations into comments section — dc37, Apr 01 '20 at 20:00
First: Your example dataset is not well chosen. It contains only data for one production system and only one obs. per farm. Second: It's an easy task to make a best fit line for each farm. However, if I got you right, you want a line of best fit for each farm by production system, i.e. farms switch production systems? — stefan, Apr 02 '20 at 06:48
Hi @stefan thanks for your response. I have changed the example data set as suggested (apologies first time on here). I would like to make a best fit line per production system - but taking into account that each point (rainfall vs yield) is paired (i.e. there are two points per farm (two years) ) so i want to generate a line of best fit that summarises the yield response to rainfall for each group (production system). Farms do not switch production systems. Many thanks for any help you can give. — Billy Faucon, Apr 02 '20 at 07:44

score 0 · Answer 1 · answered Apr 02 '20 at 07:55

0

Hope I got you right. The trick is to add group = XACIB to the set of aesthetics. BTW: As in your first approach you can also switch to geom_line instead of geom_smooth. Same plot as you have only 2 datapoints per farm. Try this:

library(ggplot2)


ggplot(cocoa, aes(x = X15_clim, y = TY_15.1.16_HA, color = Production_system, group = XACIB)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 4 rows containing non-finite values (stat_smooth).
#> Warning: Removed 4 rows containing missing values (geom_point).

^{Created on 2020-04-02 by the reprex package (v0.3.0)}

answered Apr 02 '20 at 07:55

stefan

90,330
6
25
51

thanks for this. Actually I am aiming to produce only 5 lines of best fit. 1 per Production system. I expect the plot to look like the second picture I posted (however the problem with that plot is that the lines of best fit are constructed treating each point as individual - where as they are actually paired). This is the problem I want to overcome ... – Billy Faucon Apr 02 '20 at 08:03
Hm. In that case: I have to think about that. (; Sounds like you want do a panel regression with fixed effects, i.e. constant slopes per production system but varying intercepts by farm. Don't know whether this can be achieved via geom_smooth. Probably you have to compute the regression lines manually before plotting. – stefan Apr 02 '20 at 08:24

R ggplot or other: How can I plot a line of best fit for grouped AND paired data on the same plot?

1 Answers1