10

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph. "'princomp' can only be used with more units than variables"

I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again. Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.

IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample??? Please help, been wrecking my head for a week now. Thanks guys.

Amro
  • 123,847
  • 25
  • 243
  • 454
CoolSteve
  • 261
  • 1
  • 4
  • 11
  • Can you, please, put a reproducible example? – aL3xa Apr 16 '11 at 13:59
  • @aL3xa a sample of my data would be - THis is my test table below. 3 rows and 11 cols. Is there any way i can use kmeans and plot these using R. 'free ipod apple tech google cat great mouse hello dog ball, 0.174292915 0.232990001 0.174292915 0 0 0 0 0 0 0 0, 0 0.349485002 0.261439373 0 0 0 0 0 0 0 0, 0.174292915 0 0 0.232990001 0.232990001 0 0 0 0 0 0 ' – CoolSteve Apr 17 '11 at 12:36
  • Please edit your question and put the data there. Not in the comments – JohnP Apr 17 '11 at 13:00

3 Answers3

25

The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.

In the help file for princomp it explains (read ?princomp):

 ‘princomp’ only handles so-called R-mode PCA, that is feature
 extraction of variables.  If a data matrix is supplied (possibly
 via a formula) it is required that there are at least as many
 units as variables.  For Q-mode PCA use ‘prcomp’.
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
jberg
  • 4,758
  • 2
  • 20
  • 15
  • thanks for the reply. That explaines that issue. Is there any way around this? Dp you know if I can plot my vectors suing kmeans. My data looks 'free ipod apple tech google cat great mouse hello dog ball, 0.174292915 0.232990001 0.174292915 0 0 0 0 0 0 0 0, 0 0.349485002 0.261439373 0 0 0 0 0 0 0 0' Each comma represents a new record i.e. there are 2 records here and the words are my column headings – CoolSteve Apr 17 '11 at 12:39
  • 5
    @CoolSteve you didn't read the help did you. Read `?princomp` and follow its instruction to **use** `prcomp()` instead. – Gavin Simpson Apr 17 '11 at 12:53
  • Thanks Gavin, I see the diff now and I was able to use prcomp to plot my clusters. Thansk for all the answers guys. Much appreciated – CoolSteve Apr 20 '11 at 08:29
4

Principal component analysis is underspecified if you have fewer samples than data point. Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.

Simply speaking you can look at the problems like this: If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

you can use prcomp instead of princomp

Max
  • 123
  • 1
  • 2
  • 10