Replicating a Clustering Procedure in R

Question

I'm trying to replicate a clustering procedure similar to the one described in the following paper using R. The clustering procedure is discussed in detail on pages 7 and 8. I have origin and destination coordinates for a series of shipments and I want to cluster shipments into geographic regions. However, I'm not entirely sure what form I need to structure my spatial data in before applying the k-meansprocedure in R.

My initial thought was that the input data for the paper would look something like this:

Olat    Olong    Dlat    Dlong   Dist.Vol

34.271  -86.217  34.838 -81.686  226.6021
30.889  -87.776  30.689 -88.049  400
33.524  -86.805  34.167 -84.789  674.07
33.524  -86.805 34.779  -82.311  1100.66
33.524  -86.805 36.159  -86.791  800
34.201  -86.166 40.019  -82.878  2350
31.158  -88.016 45.524  -122.675 6711.44
.         .       .      .       .       
.         .       .      .       .       
.         .       .      .       .       
31.158  -88.016 32.084  -81.1   1301.85

In that case would performing my k-means clustering in R be as simple as the following:

input <- cbind( data$Olat, data$Olong, data$Dlat, data$Dlong, data$Dist.Vol)
results <- kmeans( data, 20)   # 20 determined optimal in paper

I've been having a difficult time visualizing the results of this procedure. Most of the spatial k-means clustering examples I've been able to find have only contained one set of latitude and longitude coordinates.

I'm not sure if or how I should account for the origin destination relationship in my clustering procedure. I'd appreciate any help I can get. Thanks.

EDIT

I'm clear on how to calculate non-euclidean distances using Haversine functions. I'm having trouble understanding what exactly is meant by this passage:

"With k-means, each coordinate is first weighted proportionally to its frequency at both the origin and destination. Then according to a predetermined number, clusters are formed by minimizing the weighted distance between coordinates."

For each distinct origin and destination (lat, lon) combination could I count the frequency with which it appears as both destination and origin and then multiply that by the average shipment distance? I'm not sure how to perform the k-means algorithm in 2-dimensions while taking into account the relationship between origins and destinations.

lat    long        Dist*Vol

34.271  -86.217     226.6021
30.889  -87.776     400
.         .            .
.         .            .
.         .            .
31.158  -88.016     1301.85

Pay attention that coordinates are not points in an Euclidean plane. This means that the distance between two points is not Euclidean, it's Haversine. — Omri374, Feb 02 '16 at 14:52
I calculated my distance using a Haversine formula, thanks. My main source of confusion is how to account for the origin-destination relationship in my actual k-means clustering procedure. — Rick Arko, Feb 02 '16 at 14:58
A simple way to weight samples is to replicate them. So if you have a shipment with frequency 100, you can replicate it a 100 times in your dataset, and then run `k-means` as before. Having said that, I'm sure there are k-means packages that allow you to weight samples. — Omri374, Feb 02 '16 at 20:21

score 1 · Accepted Answer · edited May 23 '17 at 12:31

Looks like you cluster based on 5 different features - Olong, Olat, Dlong, Dlat and Dist.Vol.

If you want to create spatial clusters, you need to have only two features. If I understand correctly, you should rbind Olong with Dlong and Olat with Dlat.

 data <- data.frame(lat = c(data$Olat,data$Dlat), lon = c(data$Olong,data$Dlong)

Then you can apply k-means on this two dimensional space.

 results <- kmeans(data, 20)

Note that using Euclidean distance (default for k-means) is not the right choice of metric here. You should use Haversine or project your points to a Cartesian space.

Regarding visualization- Once you have k-means set up, you could plot the centroids + a Voronoi diagram. Looks like this is the case in the paper. See this question for more details:

Thank you. This gave me a lot to consider. I've tried to rephrase my question to add clarity. I'm still working to see if I can get what I need with this. — Rick Arko, Feb 02 '16 at 18:07

score 1 · Answer 2 · answered Feb 18 '16 at 15:55

UPDATE

I've somewhat simplified my problem by deciding to focus only on high-volume outbound shipment regions. Now my input data comes with a little over 5,000 distinct cities with accompanying average statistics CPM= Cost Per Mile and Volume= Number of Outbound Shipments.

Oid    Long      Lat    CPM    Volume
203   -85.251   31.579  1.661   97
 .      .         .       .     .

I want to cluster based on geographic distance as well as average outbound cost per mile. In order to take into account volume I replicated each Oid (origin city) in my input matrix based on the number of shipments Volume. To do this I used the following code:

Distinct.Origins <- read.csv(".... file path... ")

by_origin <- group_by(Distinct.Origins, Oid)
o.expand  <- by_origin[rep(seq(nrow(by_origin)), by_origin$Volume), 1:9]  # 9 columns in data.frame

input <- as.data.frame(cbind(o.expand$Olong, o.expand$Olat, o.expand$CPM))
colnames(input) <- c("long", "lat", "CPM")

# To get a sense of what this looks like Oid = 203 now has 97 rows

head(input)
     long    lat      CPM
 -85.251 31.579 1.661815
 -85.251 31.579 1.661815
 -85.251 31.579 1.661815
 -85.251 31.579 1.661815
 -85.251 31.579 1.661815
 -85.251 31.579 1.661815

Next I ran my actual kmeans clustering procedure.

set.seed(123)
km <- kmeans(input, 12)
cent <- as.data.frame(km$centers)

Then I created a Voronoi Plot to visualize my data as follows

# Voronoi Plot
V <- deldir(cent$long, cent$lat)
states <- map_data("state")
Statemap <- ggplot() + geom_polygon(data = states, aes(x = long, y = lat, group = group), fill = "light green", col="black")

clust.state <- Statemap + geom_point(data = input, aes(x = long, y = lat), col = factor(km$cluster))
clust.state <- clust.state + geom_label(data = cent, aes(x = long, y = lat), label = row.names(cent))     

clust.state <- clust.state + geom_segment(data = V$dirsgs, aes(x=x1, y = y1, xend = x2, yend = y2), size = 2)

Summary

This doesn't exactly provide a solution to my original question, but I've opted to post in the hope of

Receiving additional feedback so I can make further improvements ( Particularly on my kmeans input matrix or on the Voronoi Plot)
In the hope that it will help others progress on their own geographical clustering problems

Replicating a Clustering Procedure in R

2 Answers2

UPDATE

Summary