How do you compare similarities between variables in an R data frame, based on two categorical variables and one numeric variables

Question

I have a dataframe with three variables of interest: LGA(Location), Offence Category and Total (numeric) What I am hoping to do, is compare the distance/similarity between each LGA, based on the Total value, in order to create a heat map or similar structure. Is this possible? And if so, what would the process be? Here is a snippet of the data frame:

This is currently too broad, and not [reproducible](https://stackoverflow.com/q/5963269/5325862). Take a look again at [ask] and the [mcve] guidance to see how you can narrow down the question — camille, Jun 08 '21 at 02:45
Are you looking to build a [chloropleth](https://blog.stata.com/wp-content/uploads/2020/04/covid19_map2.png)? And if so is there lat/lon data? What exactly is LGA in terms of counties or cities? — Kent Orr, Jun 08 '21 at 02:46

jared_mamrot · Answer 1 · 2021-06-08T11:15:21.953

I don't really understand your question, but here is an example of a heatmap and a clustered heatmap for 'similar' data:

# Load libraries
library(tidyverse)
library(readxl)
library(httr)

# Find some data
url1 <- "https://www.bocsar.nsw.gov.au/Documents/lga/NewSouthWales.xlsx"

# Get the data and remove missing data points (NA's)
GET(url1, write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(path = tf, 2L, skip = 5) %>% 
  na.omit()

df2 <- df %>% 
  # format the data to "long format" for plotting
  pivot_longer(cols = -c(`Premises type`)) %>%
  # Change "Premises type" and "name" to factors
  mutate(`Premises type` = factor(
    `Premises type`, levels = unique(`Premises type`))
    ) %>%
  mutate(name = factor(
    name, levels = unique(name))
    ) %>%
  # Remove the "Total" counts
  filter(`Premises type` != "Total") 

# Define colours for text (white for dark fill, black for light fill)
hcl <- farver::decode_colour(viridisLite::inferno(length(df2$value)), "rgb", "hcl")
label_col <- ifelse(hcl[, "l"] > 50, "black", "white")

# Plot the data (log scale for fill)
ggplot(df2, aes(y = fct_rev(`Premises type`),
                x = name, fill = log(value))) +
  geom_tile() +
  geom_text(aes(label = value, color = factor(value)),
            show.legend = FALSE, size = 2.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1.05),
        axis.title = element_blank()) +
  scale_color_manual(values = label_col) +
  scale_fill_viridis_c(option = "inferno", na.value = "black")

And a clustered heatmap (similar Premises Type / Crime types cluster together):

# Load the raw data and format for pheatmap (expects a matrix)
dm <- read_excel(path = tf, 2L, skip = 5) %>% 
  na.omit() %>%
  column_to_rownames(var = "Premises type")

# Plot the data
pheatmap::pheatmap(as.matrix(dm), scale = "row")

Edit

I haven't used it before, so I don't know if the output is correct, but based on this SO post you can use cluster::daisy() to get the gower dissimilarity for "Premises Type" then plot using pheatmap, e.g.

library(cluster)
pheatmap::pheatmap(as.matrix(daisy(dm)))

Edit 2

You only need two variables for this heatmap (i.e. "Local government Area" (Character) and "Total" (Numeric) should be fine):

# Load libraries
library(tidyverse)
library(readxl)
library(httr)
library(cluster)
library(pheatmap)

# Find some data
url1 <- "https://www.bocsar.nsw.gov.au/Documents/lga/NewSouthWales.xlsx"

# Get the data and remove missing data points (NA's)
GET(url1, write_disk(tf <- tempfile(fileext = ".xlsx")))
df <- read_excel(path = tf, 2L, skip = 5) %>% 
  na.omit()

# Select two variables, then set the Premises type as the rownames
df3 <- df %>% 
  select(`Premises type`, Robbery) %>% 
  column_to_rownames(var = "Premises type")

# (in your case, use "column_to_rownames(`Local government Area`)"

# Then plot the heatmap
pheatmap(daisy(as.matrix(df3)),
         labels_row = rownames(df3),
         labels_col = rownames(df3))

Thanks Jared, much appreciated. I was planning to use gower distance for the similarity — Luke Hansen, Jun 08 '21 at 03:59
I've updated my answer, but I have no idea if the answer is "correct" or not — jared_mamrot, Jun 08 '21 at 04:23
That is exactly what I am after. I am just not sure how to get there, I get an error saying: error in evaluating the argument 'x' in selecting a method for function 'as.matrix': invalid type character for column numbers 3, 4invalid type character for column numbers 3, 4. Basically, I have a data frame now with two variables: Local government Area (Character) and Total (Numeric). How would I depict these variables in a heatmap like the one you have created above with pheatmap? — Luke Hansen, Jun 08 '21 at 10:30
You only need two variables for this style heatmap - see my 2nd edit above — jared_mamrot, Jun 08 '21 at 11:15

How do you compare similarities between variables in an R data frame, based on two categorical variables and one numeric variables

1 Answers1

Edit

Edit 2