Python Example for KNN or K-Means Clustering

Question

I am looking at some sample data such as this:

Data:

ID  Name    ParValue    Coupon  Maturity    Issuer  Moodys  S&P_Fitch   Grade   Risk
37833100    Apple_Inc.  1049    95  2030    Apple_Inc.  Aaa AAA Investment  Highest_Quality
02079K107   Alphabet_Inc.   1055    99  2030    Alphabet_Inc.   Aa  AA  Investment  High_Quality
11659109    Alaska_Air_Group    996 98  2030    Alaska_Air_Group    A   A   Investment  Strong
931142103   Walmart_Stores,_Inc.    1195    99  2030    Walmart_Stores,_Inc.    Baa BBB Investment  Medium_Grade
495734523   Corp._Takeover  1108    97  2021    Corp._Takeover  Ba,_B   BB,_B   Junk    Speculative
193467211   Toys_R_Us   1109    105 2021    Toys_R_Us   Caa/Ca/C    CCC/CC/C    Junk    Highly_Speculative
576300972   Enron   1062    102 2021    Enron   C   D   Junk    In_Default
983457823   Economic_Consultants_Inc.               Economic_Consultants_Inc.   Baa BBB Investment  Medium_Grade
894652378   Forecast_Backtesters_Corp.              Forecast_Backtesters_Corp.  Aaa AAA Investment  Highest_Quality

Image:

So, if WalMart has Baa, BBB, Investment, and Medium_Grade (for Moodys, S&P_Fitch, Grade, and Risk) and Economic_Consultants_Inc. has these same attributes, I can know that Economic_Consultants_Inc. has 1195, 99, and 2030 (for ParValue, Coupon, Maturity), even though these data points are missing.

This is probably a KNN problem, but I'm thinking K-Means could be useful too. Basically, I'm trying to figure out how to update missing data points (ParValue, Coupon, & Maturity), like the ones colored pink in the image above, based on similar attributes. Then, I want to group similar items together (K-Means problem). Has someone here come across a good online example of how to do this? I looked online today and found some examples using randomly generated numbers, but my data sets will NOT have randomly generated numbers. I would appreciate any insight into how to solve this problem.

Maybe this might help: https://pythonprogramming.net/k-means-titanic-dataset-machine-learning-tutorial/ You can also follow the same tutorial for designing K-Means from scratch. — Faizan Naseer, May 30 '19 at 05:53

score 2 · Answer 1 · answered May 30 '19 at 10:41

2

What you seem to be missing is pandas.

I suggest you go through the 10 min tutorial to get started. The approach should be

Load the data into a dataframe using pandas,
Use the apply method to fill the missing values, based on the conditions you stated above.

This answer is similar to what you might have to do.

answered May 30 '19 at 10:41

Ayesh Salahuddin

75
1
2
10

Thanks everyone. This makes perfect sense. One more things is that not all attributes will match up perfectly, so need to do this using some kind of relative match, like 95% match, 89% match. I need to do some kind of cohort matching and get the relative match. Can someone here post an example of how to do that? I Googled and found a few things, but nothing that I'm really thrilled about. Thanks. – ASH May 30 '19 at 13:24

score 1 · Accepted Answer · answered May 30 '19 at 12:01

1

also you can use, missing value imputation using impyute package.

answered May 30 '19 at 12:01

JAbr

312
2
12

Python Example for KNN or K-Means Clustering

2 Answers2