I'm trying to replicate a clustering procedure similar to the one described in the following paper using R. The clustering procedure is discussed in detail on pages 7 and 8. I have origin and destination coordinates for a series of shipments and I want to cluster shipments into geographic regions. However, I'm not entirely sure what form I need to structure my spatial data in before applying the k-means
procedure in R.
My initial thought was that the input data for the paper would look something like this:
Olat Olong Dlat Dlong Dist.Vol
34.271 -86.217 34.838 -81.686 226.6021
30.889 -87.776 30.689 -88.049 400
33.524 -86.805 34.167 -84.789 674.07
33.524 -86.805 34.779 -82.311 1100.66
33.524 -86.805 36.159 -86.791 800
34.201 -86.166 40.019 -82.878 2350
31.158 -88.016 45.524 -122.675 6711.44
. . . . .
. . . . .
. . . . .
31.158 -88.016 32.084 -81.1 1301.85
In that case would performing my k-means clustering in R be as simple as the following:
input <- cbind( data$Olat, data$Olong, data$Dlat, data$Dlong, data$Dist.Vol)
results <- kmeans( data, 20) # 20 determined optimal in paper
I've been having a difficult time visualizing the results of this procedure. Most of the spatial k-means clustering examples I've been able to find have only contained one set of latitude and longitude coordinates.
I'm not sure if or how I should account for the origin destination relationship in my clustering procedure. I'd appreciate any help I can get. Thanks.
EDIT
I'm clear on how to calculate non-euclidean distances using Haversine functions. I'm having trouble understanding what exactly is meant by this passage:
"With k-means, each coordinate is first weighted proportionally to its frequency at both the origin and destination. Then according to a predetermined number, clusters are formed by minimizing the weighted distance between coordinates."
For each distinct origin and destination (lat, lon) combination could I count the frequency with which it appears as both destination and origin and then multiply that by the average shipment distance? I'm not sure how to perform the k-means algorithm in 2-dimensions while taking into account the relationship between origins and destinations.
lat long Dist*Vol
34.271 -86.217 226.6021
30.889 -87.776 400
. . .
. . .
. . .
31.158 -88.016 1301.85