1

I'm analysing real-estate sales for some N. American cities and am using k-means clustering on the data. I have seven clusters and for each observation in the cluster I have the latitude, longitude, zipcode, and cluster_id. I'd like to plot this on a map to better visualize the clusters - I'm not sure what such a plot is called - Choropleth? Polygon?

Most of the examples are using geoJSON files but I only have a data.frame object from my k-means clustering.

Actual data:

https://www.kaggle.com/threnjen/portland-housing-prices-sales-jul-2020-jul-2021

Sample data:

> dput(dt[runif(n = 10,min = 1,max = 25000)])
structure(list(id = c(23126L, 15434L, 5035L, 19573L, NA, 24486L, 
NA, 14507L, 3533L, 20192L), zipcode = c(97224L, 97211L, 97221L, 
97027L, NA, 97078L, NA, 97215L, 97124L, 97045L), latitude = c(45.40525436, 
45.55965805, 45.4983139, 45.39398956, NA, 45.47454071, NA, 45.50736618, 
45.52812958, 45.34381485), longitude = c(-122.7599182, -122.6500015, 
-122.7288742, -122.591217, NA, -122.8898392, NA, -122.6084061, 
-122.91745, -122.5948334), lastSoldPrice = c(469900L, 599000L, 
2280000L, 555000L, NA, 370000L, NA, 605000L, 474900L, 300000L
), lotSize = c(5227L, 4791L, 64904L, 9147L, NA, 2178L, NA, 4356L, 
2613L, 6969L), livingArea = c(1832L, 2935L, 5785L, 2812L, NA, 
1667L, NA, 2862L, 1844L, 742L), cluster_id = c(7, 7, 2, 7, NA, 
4, NA, 7, 7, 4)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7faa8000fee0>)

I've followed the example on https://gist.github.com/josecarlosgonz/8565908 to try and create a geoJSON file to be able to plot this data but without success.

I'm not using markers because I have ~25,000 observations - it would be difficult to plot them all and the file would take forever to load.

EDIT:

observations by zipcode:

> dput(dat[, .N, by = .(`address/zipcode`)][(order(`address/zipcode`))])
structure(list(`address/zipcode` = c(7123L, 97003L, 97004L, 97005L, 
97006L, 97007L, 97008L, 97009L, 97015L, 97019L, 97023L, 97024L, 
97027L, 97030L, 97034L, 97035L, 97038L, 97045L, 97056L, 97060L, 
97062L, 97068L, 97070L, 97078L, 97080L, 97086L, 97089L, 97113L, 
97123L, 97124L, 97132L, 97140L, 97201L, 97202L, 97203L, 97204L, 
97205L, 97206L, 97209L, 97210L, 97211L, 97212L, 97213L, 97214L, 
97215L, 97216L, 97217L, 97218L, 97219L, 97220L, 97221L, 97222L, 
97223L, 97224L, 97225L, 97227L, 97229L, 97230L, 97231L, 97232L, 
97233L, 97236L, 97239L, 97266L, 97267L), N = c(1L, 352L, 9L, 
252L, 421L, 1077L, 357L, 1L, 31L, 2L, 4L, 159L, 239L, 525L, 640L, 
548L, 1L, 1064L, 5L, 353L, 471L, 736L, 6L, 403L, 866L, 913L, 
8L, 5L, 1113L, 776L, 3L, 543L, 219L, 684L, 463L, 1L, 57L, 809L, 
189L, 216L, 688L, 510L, 504L, 330L, 318L, 177L, 734L, 195L, 832L, 
305L, 276L, 589L, 688L, 716L, 286L, 83L, 1307L, 475L, 77L, 150L, 
382L, 444L, 290L, 423L, 430L)), row.names = c(NA, -65L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7f904781a6e0>)
Gautam
  • 2,597
  • 1
  • 28
  • 51
  • are your geocoordinates the locations of the houses or are they representative for the zipcode already? If the later: How would you like to aggregate the data since there are possibly multiple observations per zip-code and we already have the cluster variable to be displayed? – DPH Feb 05 '22 at 03:43
  • @DPH they're location of the properties. In the entire dataset, 9 observations have more than 1 zipcode (repeat sales, bad data) all others have exactly one. zipcode is not part of the variables used for clustering. For aggregating, I want use the clusters from k-means algorithm instead of zipcode. I would, however, want to do a choropleth with zipcodes as well (part of the report). – Gautam Feb 05 '22 at 16:56
  • @DPH I've edited the question to add a summary of the no. of observations by zipcode - hope that makes it a bit clearer. – Gautam Feb 05 '22 at 17:00

1 Answers1

0

I used the kaggle data on a simple laptop (i3 8th gen) to generate a ggplot2 object, with cluster IDs randomly sampled and transform this via the ggplotly() function ... the resulting plotly object seems OK to work with for analysis but I do not know your performance requirements:

library(dplyr)
library(ggplot2)
library(plotly)
library(rnaturalearth) # here we get the basic map data from

# read in data from zip, select minimal number of columns and sample cluster_id
df <- readr::read_csv(unzip("path_to_zip/portland_housing.csv.zip"))%>% 
    dplyr::select(az = `address/zipcode`, latitude, longitude) %>%              
    dplyr::mutate(cluster_id = sample(1:7, n(), replace = TRUE))
# get the map data
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
# build the ggplot2 object (note that I use rings as shapes and alpha parameter to reduce the over plotting
plt <- ggplot2::ggplot(data = world) +
    ggplot2::geom_sf() +
    ggplot2::geom_point(data = df, aes(x = longitude, y = latitude, color = factor(cluster_id)), size = 1, shape = 21, alpha = .7) + 
    ggplot2::coord_sf(xlim = c(-124.5, -122), ylim = c(45, 46), expand = FALSE)
# plot it:
plt

enter image description here

# plotly auto transform from ggplot2 object
plotly::ggplotly(plt)

enter image description here

EDIT

To include a map you can use for example the ggmap package instead of the map data from rnaturalearth... I will only display the plotly result:

library(ggmap)

# https://stackoverflow.com/questions/23130604/plot-coordinates-on-map
sbbox <- ggmap::make_bbox(lon = c(-124.5, -122), lat = c(45, 46), f = .1)
myarea <- ggmap::get_map(location=sbbox, zoom=10, maptype="terrain")
myarea <- ggmap::ggmap(myarea)

plt2 <- myarea +
    ggplot2::geom_point(data = df, mapping = aes(x = longitude, y = latitude, color = factor(cluster_id)), shape = 21, alpha = .7) 

plotly::ggplotly(plt2)

enter image description here

There are many other approaches concerning the map data, like using the mapbox-api

DPH
  • 4,244
  • 1
  • 8
  • 18
  • I'm able to follow the `ggplot` & `plotly` route but I need to plot this on a map for the report. – Gautam Feb 06 '22 at 12:10