0

I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below

import pandas as pd

data = [12,24,14,20,8,10,5,23]
  
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
        'Scores_section1': [12,24,14,20,8,10,5,23],
       'Scores_section2' : [20,4,1,0,18,9,12,10],
       'Sum_all_scores': [32,28,15,20,26,19,17,33]}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# print dataframe.
df

I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?

Many thanks in advance for your help!

Kathia
  • 502
  • 2
  • 7
  • 20
  • Is K-Means clustering the right tool for the job? K-Means is a technique designed to split data up into groups with as little intragroup variance as possible, which doesn't seem like the right goal. In particular, there's no guarantee that the groups will have any useful qualitative interpretation: you might get a cluster which is good at level 3 but bad at level 2, so where do you put them? – Nick ODell Feb 07 '23 at 21:11
  • Hi, that's an interesting question, but the question content doesn't really show your research. Could you please [edit] it to include what you've searched for (on the topic of clustering)? Have you found these other question already for example? https://stackoverflow.com/q/7869609/1256347 https://stackoverflow.com/q/11513484/1256347 https://stackoverflow.com/q/35094454/1256347 https://stats.stackexchange.com/q/13781/93778 – Saaru Lindestøkke Feb 07 '23 at 21:11
  • Why not do something like take the sum of scores, then put the students into groups based on rank? For example, the top third goes into one class, the middle third into another, and the bottom third into another. – Nick ODell Feb 07 '23 at 21:13

1 Answers1

1

Try it this way.

import pandas as pd

data = [12,24,14,20,8,10,5,23]
  
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
        'Scores_section1': [12,24,14,20,8,10,5,23],
       'Scores_section2' : [20,4,1,0,18,9,12,10],
       'Sum_all_scores': [32,28,15,20,26,19,17,33]}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# print dataframe.
df


#Import required module
from sklearn.cluster import KMeans
 
#Initialize the class object
kmeans = KMeans(n_clusters=3)
 
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label


df['kmeans'] = label
df


# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to 
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets 
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably 
# efficient in the sense of within-class variance.

# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()

enter image description here

Feel free to modify the code to suit your needs.

ASH
  • 20,759
  • 19
  • 87
  • 200