I am trying to find the relative similarity, and/or difference, between items, such as basketball players. I am using a KNN classifier for this task. For instance, based on data, I want to see how similar Lebron James is to, let's say Carmelo Anthony, and I want to see how similar Lebron James is to Ray Allen, and I want to see how similar Carmelo Anthony is to Ray Allen. I want to compare each person to each other person.
I am running the code below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
with open('C:\\path_here\\nba.csv', 'r') as csvfile:
dataset = pd.read_csv(csvfile)
print(dataset.columns.values)
# convert to dataframe
dataset=pd.DataFrame(dataset)
dataset.dtypes
# fill NAs with zeros
dataset = dataset.fillna(0)
dataset.isnull().sum()
dataset.isnull().sum().sum()
dataset.head()
X = dataset.iloc[:,4:27]
y = dataset.iloc[:,28]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
The data comes from here:
https://www.dropbox.com/s/b3nv38jjo5dxcl6/nba_2013.csv?dl=0
Basically, the code runs fine, but the output looks weird, or I forgot to include something. Anyway, Iām trying to get results something like this:
LebronJames vs. CarmeloAnthony: .95
CarmeloAnthony vs. RayAllen: .92
RayAllen vs. LebronJames: .91
https://pythonhosted.org/scikit-fuzzy/auto_examples/plot_cmeans.html