I'm analysing real-estate sales for some N. American cities and am using k-means clustering on the data. I have seven clusters and for each observation in the cluster I have the latitude
, longitude
, zipcode
, and cluster_id
. I'd like to plot this on a map to better visualize the clusters - I'm not sure what such a plot is called - Choropleth? Polygon?
Most of the examples are using geoJSON files but I only have a data.frame
object from my k-means clustering.
Actual data:
https://www.kaggle.com/threnjen/portland-housing-prices-sales-jul-2020-jul-2021
Sample data:
> dput(dt[runif(n = 10,min = 1,max = 25000)])
structure(list(id = c(23126L, 15434L, 5035L, 19573L, NA, 24486L,
NA, 14507L, 3533L, 20192L), zipcode = c(97224L, 97211L, 97221L,
97027L, NA, 97078L, NA, 97215L, 97124L, 97045L), latitude = c(45.40525436,
45.55965805, 45.4983139, 45.39398956, NA, 45.47454071, NA, 45.50736618,
45.52812958, 45.34381485), longitude = c(-122.7599182, -122.6500015,
-122.7288742, -122.591217, NA, -122.8898392, NA, -122.6084061,
-122.91745, -122.5948334), lastSoldPrice = c(469900L, 599000L,
2280000L, 555000L, NA, 370000L, NA, 605000L, 474900L, 300000L
), lotSize = c(5227L, 4791L, 64904L, 9147L, NA, 2178L, NA, 4356L,
2613L, 6969L), livingArea = c(1832L, 2935L, 5785L, 2812L, NA,
1667L, NA, 2862L, 1844L, 742L), cluster_id = c(7, 7, 2, 7, NA,
4, NA, 7, 7, 4)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7faa8000fee0>)
I've followed the example on https://gist.github.com/josecarlosgonz/8565908 to try and create a geoJSON file to be able to plot this data but without success.
I'm not using markers because I have ~25,000 observations - it would be difficult to plot them all and the file would take forever to load.
EDIT:
observations by zipcode:
> dput(dat[, .N, by = .(`address/zipcode`)][(order(`address/zipcode`))])
structure(list(`address/zipcode` = c(7123L, 97003L, 97004L, 97005L,
97006L, 97007L, 97008L, 97009L, 97015L, 97019L, 97023L, 97024L,
97027L, 97030L, 97034L, 97035L, 97038L, 97045L, 97056L, 97060L,
97062L, 97068L, 97070L, 97078L, 97080L, 97086L, 97089L, 97113L,
97123L, 97124L, 97132L, 97140L, 97201L, 97202L, 97203L, 97204L,
97205L, 97206L, 97209L, 97210L, 97211L, 97212L, 97213L, 97214L,
97215L, 97216L, 97217L, 97218L, 97219L, 97220L, 97221L, 97222L,
97223L, 97224L, 97225L, 97227L, 97229L, 97230L, 97231L, 97232L,
97233L, 97236L, 97239L, 97266L, 97267L), N = c(1L, 352L, 9L,
252L, 421L, 1077L, 357L, 1L, 31L, 2L, 4L, 159L, 239L, 525L, 640L,
548L, 1L, 1064L, 5L, 353L, 471L, 736L, 6L, 403L, 866L, 913L,
8L, 5L, 1113L, 776L, 3L, 543L, 219L, 684L, 463L, 1L, 57L, 809L,
189L, 216L, 688L, 510L, 504L, 330L, 318L, 177L, 734L, 195L, 832L,
305L, 276L, 589L, 688L, 716L, 286L, 83L, 1307L, 475L, 77L, 150L,
382L, 444L, 290L, 423L, 430L)), row.names = c(NA, -65L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7f904781a6e0>)