I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want to export the id on different parts in the plot. Is there a way to do this?
Asked
Active
Viewed 67 times
0
-
1what do you mean by "I want to export the id on different parts in the plot" ? Are you looking for a clustering algorithm to identify the students that improved and the ones that did not? – RockScience Mar 11 '15 at 04:05
-
I have their individual scores, and I want to somehow extract the groups on the plot. For example, there are two big groups on the plot and I want to know the ids of thoses two groups. What do you mean by clustering algorithm? I think that would be helpful too. Actually I have four tests, and I am trying to group students into similar growth patterns. Can you give me an example of your algorithm? Thank you!@RockScience – William Liu Mar 11 '15 at 04:10
-
2William you should do some research on clustering, there are many ways to identify groups of id from a data set. http://www.statmethods.net/advstats/cluster.html I think in your case a simple k-mean cluster would work. – RockScience Mar 11 '15 at 04:19
-
I suggest to move this question to stats.stackexchange.com – RockScience Mar 11 '15 at 04:24
-
Example data and an example output would be really useful too. – tospig Mar 11 '15 at 04:34
3 Answers
2
If myData
is your data set, you can identify each group using a kmeans agorithm: (Make sure x
and y
are centered and normalized accordingly before)
myData <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(myData) <- c("x", "y")
(cl <- kmeans(myData, 2))
plot(myData, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

RockScience
- 17,932
- 26
- 89
- 125
0
Adds to the answer from @RockScience,
Maybe a better way to do this is to do first decide the number of clusters instead of assigning the number of clusters as 2, in that way you probability will get the exact group of people instead of dividing the whole group into just 2 clusters.
A link on how to find the number of clusters: find the number of clusters
0
Why not select by thresholds?
You are interested in students in a particular range.
So why not formalize the range, and select where 0

Has QUIT--Anony-Mousse
- 76,138
- 12
- 138
- 194