1

I have a dataframe that can be reconstructed from the dict below.

The dataframe represents 23 statistics (X1-X23) for various cities around the world. Each city occupies a single row in the dataframe with the 23 statistics as separate columns.

My actual df has ~6 million cities so its a large dataframe.

What I want to do is:

Step#1: Identify clusters of cities based on the 23 statistics (X1-X23).

Step#2: Given the identified clusters in Step#1, I want to construct a portfolio of cities such that:

a) number of cities selected from any given cluster is limited (limit may be different for each cluster)

b) avoid certain clusters altogether

c) apply additional criteria to the portfolio selection such that the correlation of poor weather between cities in the portfolio is minimized and correlation of good weather between cities is maximized.

My problem set is such that the K for a K-means algo would be quite large but I'm not sure what that value is.

I've been reading the following on clustering:

Cluster analysis in R: determine the optimal number of clusters

How do I determine k when using k-means clustering?

X-means: Extending K-means...

However, a lot of the literature is foreign to me and will take me months to understand. I'm not a data scientist and don't have the time to take a course on machine learning.

At this point I have the dataframe and am now twiddling my thumbs.

I'd be grateful if you can help me move forward in actually implementing Steps#1 to Steps#2 in pandas with an example dataset.

The dict below can be reconstructed to a dataframe by pd.DataFrame(x) where x is the dict below:

Output of df.head().to_dict('rec'):

[{'X1': 123.40000000000001,
  'X2': -67.900000000000006,
  'X3': 172.0,
  'X4': -2507.1999999999998,
  'X5': 80.0,
  'X6': 1692.0999999999999,
  'X7': 13.5,
  'X8': 136.30000000000001,
  'X9': -187.09999999999999,
  'X10': 50.0,
  'X11': -822.0,
  'X12': 13.0,
  'X13': 260.80000000000001,
  'X14': 14084.0,
  'X15': -944.89999999999998,
  'X16': 224.59999999999999,
  'X17': -23.100000000000001,
  'X18': -16.199999999999999,
  'X19': 1825.9000000000001,
  'X20': 710.70000000000005,
  'X21': -16.199999999999999,
  'X22': 1825.9000000000001,
  'X23': 66.0,
  'city': 'SFO'},
 {'X1': -359.69999999999999,
  'X2': -84.299999999999997,
  'X3': 86.0,
  'X4': -1894.4000000000001,
  'X5': 166.0,
  'X6': 882.39999999999998,
  'X7': -19.0,
  'X8': -133.30000000000001,
  'X9': -84.799999999999997,
  'X10': 27.0,
  'X11': -587.29999999999995,
  'X12': 36.0,
  'X13': 332.89999999999998,
  'X14': 825.20000000000005,
  'X15': -3182.5,
  'X16': -210.80000000000001,
  'X17': 87.400000000000006,
  'X18': -443.69999999999999,
  'X19': -3182.5,
  'X20': 51.899999999999999,
  'X21': -443.69999999999999,
  'X22': -722.89999999999998,
  'X23': -3182.5,
  'city': 'YYZ'},
 {'X1': -24.800000000000001,
  'X2': -34.299999999999997,
  'X3': 166.0,
  'X4': -2352.6999999999998,
  'X5': 87.0,
  'X6': 1941.3,
  'X7': 56.600000000000001,
  'X8': 120.2,
  'X9': -65.400000000000006,
  'X10': 44.0,
  'X11': -610.89999999999998,
  'X12': 19.0,
  'X13': 414.80000000000001,
  'X14': 4891.1999999999998,
  'X15': -2396.0999999999999,
  'X16': 181.59999999999999,
  'X17': 177.0,
  'X18': -92.900000000000006,
  'X19': -2396.0999999999999,
  'X20': 805.60000000000002,
  'X21': -92.900000000000006,
  'X22': -379.69999999999999,
  'X23': -2396.0999999999999,
  'city': 'DFW'},
 {'X1': -21.300000000000001,
  'X2': -47.399999999999999,
  'X3': 166.0,
  'X4': -2405.5999999999999,
  'X5': 85.0,
  'X6': 1836.8,
  'X7': 55.700000000000003,
  'X8': 130.80000000000001,
  'X9': -131.09999999999999,
  'X10': 47.0,
  'X11': -690.60000000000002,
  'X12': 16.0,
  'X13': 297.30000000000001,
  'X14': 5163.3999999999996,
  'X15': -2446.4000000000001,
  'X16': 182.30000000000001,
  'X17': 83.599999999999994,
  'X18': -36.0,
  'X19': -2446.4000000000001,
  'X20': 771.29999999999995,
  'X21': -36.0,
  'X22': -378.30000000000001,
  'X23': -2446.4000000000001,
  'city': 'PDX'},
 {'X1': -22.399999999999999,
  'X2': -9.0,
  'X3': 167.0,
  'X4': -2405.5999999999999,
  'X5': 86.0,
  'X6': 2297.9000000000001,
  'X7': 41.0,
  'X8': 109.7,
  'X9': 64.900000000000006,
  'X10': 42.0,
  'X11': -558.29999999999995,
  'X12': 21.0,
  'X13': 753.10000000000002,
  'X14': 5979.6999999999998,
  'X15': -2370.1999999999998,
  'X16': 187.40000000000001,
  'X17': 373.10000000000002,
  'X18': -224.30000000000001,
  'X19': -2370.1999999999998,
  'X20': 759.5,
  'X21': -224.30000000000001,
  'X22': -384.39999999999998,
  'X23': -2370.1999999999998,
  'city': 'EWR'}]
Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126
  • 1
    You should post this question on [CrossValidated](http://stats.stackexchange.com), since this is not really a question about programming but about how to do clustering in general. Of course, do a search beforehand --- I'm pretty sure this has been asked before. – Alicia Garcia-Raboso Aug 24 '16 at 21:00
  • Funnily enough, a search on CrossValidated yields some posts that end up referring to StackOverflow --- [this answer](http://stackoverflow.com/questions/1793532/how-do-i-determine-k-when-using-k-means-clustering) in particular. – Alicia Garcia-Raboso Aug 24 '16 at 21:05
  • 2
    You can ask on [Data Science SE](http://datascience.stackexchange.com/) too. Although there are some approaches to determine the optimal number of *k*, they all have their own criteria and those criteria may not match your needs. In practice, people generally try different k values and judge the results themselves. – ayhan Aug 24 '16 at 21:07
  • @ayhan: I didn't even know about the Data Science SE... Thanks for the pointer! – Alicia Garcia-Raboso Aug 24 '16 at 21:18
  • @AlbertoGarcia-Raboso It is still in beta but has very nice discussions. – ayhan Aug 24 '16 at 21:19

1 Answers1

2

I don't know what you mean by "for further processing" but here is a super simple explanation to get you started.

1) get the data into a dataframe (pandas) with the variables (x1-x23) across the top (column headers) and each row representing a different city (so that your df.head() shows x1-x23 for column headers).

2) standardize the variables

3) decide whether to use PCA before using Kmeans

4) use kmeans- scikit learn makes this part easy check this too and this

5) try this silhouette analysis for choosing the number of clusters to get a start

references that are good:
Hastie and Tibshirani book

Hastie and Tibshirani free course, but use R

Udacity, Coursera, EDX courses on machine learning

EDIT: forgot to mention, don't use your whole dataset while you are testing out the processes. Use a much smaller portion of the data (e.g. 100K cities) so that the processing time is much less until you get everything right.

ivan7707
  • 1,146
  • 1
  • 13
  • 24
  • what I mean by "further processing" is based on the identified clusters I want to make a selection algorithm that enforces a limit on the number of cities that can be selected from any cluster. – codingknob Aug 24 '16 at 21:53
  • @codingknob follow what I outlined above to get an initial clustering. Then ask specific questions. – ivan7707 Aug 24 '16 at 23:25