data dropping during join

Question

I am working on a risk matrix that shows a count of issues identified using qualitative data analysis using dplyr and ggplot2. However, when I attempt to dplyr::left_join() my count data to a grid specifying positions and colors, a few of the counts are dropped.

Can someone explain to me why the rows in df_n with counts higher than 1 or 2 seem to get dropped when merged into df_plot? Thanks!

library(tidyverse)

df_n <- structure(list(frequency = c(0.3, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7, 
0.7, 0.7, 0.7, 0.9, 0.9, 0.9), criticality = c(30, 70, 90, 10, 
30, 50, 30, 50, 70, 90, 50, 70, 90), inefficiency = c(9, 21, 
27, 5, 15, 25, 21, 35, 49, 63, 45, 63, 81), color = c("green", 
"green", "yellow", "green", "green", "yellow", "green", "yellow", 
"red", "red", "yellow", "red", "red"), n = c(1L, 1L, 1L, 1L, 
1L, 1L, 4L, 2L, 5L, 2L, 1L, 2L, 1L)), row.names = c(NA, -13L), class = c("tbl_df", 
"tbl", "data.frame"))

df_color <- structure(list(frequency = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 
0.3, 0.3, 0.3, 0.5, 0.5, 0.5, 0.5, 0.5, 0.7, 0.7, 0.7, 0.7, 0.7, 
0.9, 0.9, 0.9, 0.9, 0.9), criticality = c(10, 30, 50, 70, 90, 
10, 30, 50, 70, 90, 10, 30, 50, 70, 90, 10, 30, 50, 70, 90, 10, 
30, 50, 70, 90), inefficiency = c(1, 3, 5, 7, 9, 3, 9, 15, 21, 
27, 5, 15, 25, 35, 45, 7, 21, 35, 49, 63, 9, 27, 45, 63, 81), 
    color = c("green", "green", "green", "green", "green", "green", 
    "green", "green", "green", "yellow", "green", "green", "yellow", 
    "yellow", "yellow", "green", "green", "yellow", "red", "red", 
    "green", "yellow", "yellow", "red", "red")), row.names = c(NA, 
-25L), class = c("tbl_df", "tbl", "data.frame"))

df_plot <- df_color %>%
dplyr::left_join(df_n, by = c("frequency", "criticality", "inefficiency", "color"))  


  ggplot2::ggplot(data = df_plot, ggplot2::aes(x = frequency, y = criticality, fill = color)) +
  ggplot2::geom_tile(color = "white", lwd = 1.5, linetype = 1) +
  ggplot2::scale_fill_identity() +
    ggplot2::geom_text(aes(label = n, fontface = "bold")) +
  ggplot2::theme_classic()

You missed one `ggplot::` before the `aes()` in the `geom_text` layer (not related to your problem, just an observation). — Gregor Thomas, Aug 23 '22 at 13:38
There are observations in df_color that have no match in df_n. Hence no values. For example, no values in df_n for frequence 0.1. But in your df_plot the values 4 and 5 are appearing with frequency 0.7. Not sure what you are missing, but everything comes down to not having matches in the keys when joining.. — phiver, Aug 23 '22 at 13:42
When I run your code, I see a 5x5 grid of tiles, of which 13 have labels. Since `df_n` has 13 rows, this seems correct--nothing is dropped. — Gregor Thomas, Aug 23 '22 at 13:42
Though I will say that including a non-integer numeric column (`frequency`) in your join is a little risky [due to floating point precision issues](https://stackoverflow.com/q/9508518/903061). — Gregor Thomas, Aug 23 '22 at 13:44
I wonder if you perhaps wanted `df_n %>% left_join(df_color, ..etc..)`, so that the end result is 13 rows, all of which are matched. Order matters with left join. — Allan Cameron, Aug 23 '22 at 13:45
hi @AllanCameron, since I need to specify the background color for the geom_tile() even if a cell has no entries, I want to left_join the counts onto the full list of possible frequency & criticality values — M. Wood, Aug 23 '22 at 16:03
@M.Wood in that case, all the data from df_n appears to be in df_plot. Can you explain what has been dropped? — Allan Cameron, Aug 23 '22 at 16:08
@GregorThomas thanks for that reference. I think that precision issue is just what the problem was. I removed `frequency` from the `by = ...` in the join and things work just fine. When I saved these objects off using `dput()` and they were read in on another system, the `frequency` values were read in more precisely. — M. Wood, Aug 23 '22 at 16:28

score 0 · Accepted Answer · answered Aug 23 '22 at 16:33

Inspired by @Allan Cameron's comment to the OP, I dropped frequency from the join and it worked fine. Since frequency * criticality = ineffiency in this case, the frequency values were already encoded in another variable.

When I saved off both dataframe objects to be joined using dput() and it was subsequently read in by the good folks here, the precision error that Allan cited was resolved. This is why you all saw the correct join when merging those objects.

Thanks again all for your help!

data dropping during join

1 Answers1