1

I have a data matrix (900 columns and 5000 rows), which I would like to do a pca on..

The matrix looks very well in excel (meaning all the values are quantitative), but after I read my file in R and try to run the pca code , i get an error saying that "The following variables are not quantitative" and I get a list of non-quantitative variables.

So in general, some variables are quantitative and some are not. See the example as follows. When I check for variable 1, it is correct and quantitative.. (randomly some variables are quantitative in the file) When I check for variable 2, it is incorrect and non-quantitative.. (randomly some variables like this are non-quantitative in the file)

> data$variable1[1:5]
[1] -0.7617504 -0.9740939 -0.5089303 -0.1032487 -0.1245882

> data$variable2[1:5]
[1] -0.183546332959017 -0.179283451229594 -0.191165669598284 -0.187060515423038
[5] -0.184409474669824
731 Levels: -0.001841783473108 -0.001855956210119 ... -1,97E+05

So my question is, how can I change all the non-quantitative variables into quantitative ??

Making the file short does not help , as the values get quantitative on its own. I do not know whats happening. So here is the link for my original file <- https://docs.google.com/file/d/0BzP-YLnUNCdwakc4dnhYdEpudjQ/edit

I also tried the answers given below, but it still doesnt help.

So let me show what exactly I had done,

> data <- read.delim("file.txt", header=T)
> res.pca = PCA(data, quali.sup=1, graph=T)
Error in PCA(data, quali.sup = 1, graph = T) :
The following variables are not quantitative:  batch
The following variables are not quantitative:  target79
The following variables are not quantitative:  target148
The following variables are not quantitative:  target151
The following variables are not quantitative:  target217
The following variables are not quantitative:  target266
The following variables are not quantitative:  target515
The following variables are not quantitative:  target530
The following variables are not quantitative:  target587
The following variables are not quantitative:  target620
The following variables are not quantitative:  target730
The following variables are not quantitative:  target739
The following variables are not quantitative:  target801
The following variables are not quantitative:  target803
The following variables are not quantitative:  target809
The following variables are not quantitative:  target819
The following variables are not quantitative:  target868
The following variables a
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Letin
  • 1,255
  • 5
  • 20
  • 36
  • 1
    I might be wrong, but I suspect that 97E+05 is doing the trick. Check for entries containing things like that which are not numbers. Are you exporting as CSV? – sebastian-c Feb 28 '13 at 09:58
  • @sebastian-c I now removed all the values with "E" in the file (like -1,97E+05) .. i still get the same error.. I have it exported as a "text tab delimited".. Another thing here is that, check the difference in values with variable1 and variable2. The quantitative variables are short and the non-quantitative are long. – Letin Feb 28 '13 at 10:08
  • How does your data get from Excel to R? That's a factor you have in variable2. – themel Feb 28 '13 at 10:09
  • Please link us to your CSV file or create a short example with which we can reproduce this problem. We can only speculate until then. – Arun Feb 28 '13 at 10:10
  • Yes thats true. Its a factor there. When I first opened the txt file in excel, I had all the variables as quantitative. But when i now try to read it in R and run PCA, i get factors in some variables. Do you know if i can convert them all into quantitative/numbers ? – Letin Feb 28 '13 at 10:11
  • @Arun, yes let me do that here. Give me a few moments. – Letin Feb 28 '13 at 10:12
  • I have shown an example of my file above. But both quantitative and non-quantitative look the same here.. – Letin Feb 28 '13 at 10:33
  • @PoojaMandaviya I cannot reproduce the error. Are you using `header=TRUE`? – sebastian-c Feb 28 '13 at 10:43
  • @PoojaMandaviya, I have no problems loading this data. It loads all columns as `numeric`. Can you run your code on **this** test data and edit your post once again with **your code and output**? – Arun Feb 28 '13 at 10:44
  • Sorry about my late reply. I have done some editing on my question to show what exactly I have done. I request you to see it again. – Letin Feb 28 '13 at 12:28
  • @PoojaMandaviya Your file is still littered with entries like 1,97E+05. I can't even read it in because I get an error at line 97 claiming it doesn't have 4827 elements. – sebastian-c Feb 28 '13 at 13:12
  • The problems in this question seem far too specific to your dataset. I'm voting to close as too localised. – sebastian-c Feb 28 '13 at 13:35
  • @sebastian-c : I had removed those E entries , but i did not realise they were still in the file i gave a link to. I apologise for it. I am still struggling on converting these values. But anyways I will try figuring it out. Thanks for your help.. – Letin Feb 28 '13 at 14:07

3 Answers3

1

By default, R coerces strings to factors. This can result in unexpected behavior. Turn off this default option with:

      read.csv(x, stringsAsFactors=F)

You can, alternatively, coerce factors to numeric with

      newVar<-as.numeric(oldVar)
charlie
  • 602
  • 4
  • 12
  • Hey Charlie, thanks for your reply. But it says here that , file_new <- as.numeric(file) Error: (list) object cannot be coerced to type 'double' – Letin Feb 28 '13 at 12:22
  • You get that error since the object `file_new` is created with class dataframe, because some variables are numeric and some are character. (check with `class(file_new)`) – Oscar de León Feb 28 '13 at 12:55
  • Right you are. I should have been clearer. You can't coerce the entire dataframe. And, as Edwin correctly points out, you may not want to. In my experience, the default conversion to factors in read.table() can cause headaches. I've set my editor to enter "stringsAsFactor=FALSE" as default. – charlie Feb 28 '13 at 21:34
0

R considers your variables as factors, as mentioned by Arun. Therefore it makes a data.frame (which in fact is a list). There are numerous ways to solve this problem, one would be converting it into a data matrix in the following way;

matrix <- as.numeric(as.matrix(data))
dim(matrix) <- dim(data)

Now you can run your PCA on the matrix.

Edit:

Extending the example a bit, the second part of charlie's suggestion won't work. Copy the following session and see how it works;

d <- data.frame(
 a = factor(runif(2000)),
 b = factor(runif(2000)),
 c = factor(runif(2000)))

as.numeric(d) #does not work on a list (data frame is a list)

as.numeric(d$a) # does work, because d$a is a vecor, but this is not what you are 
# after. R converts the factor levels to numeric instead of the actual value.

(m <- as.numeric(as.matrix(d))) # this does the rigth thing
dim(m)                        # but m loses the dimensions and is now a vector

dim(m) <- dim(d)              # assign the dimensions of d to m

svd(m)                        # you can do the PCA function of your liking on m
Edwin
  • 3,184
  • 1
  • 23
  • 25
  • Thanks Edwin. Let me try this and get back. I was just spending time on reruning my analysis on the file and getting back with the specific errors. And also will give a link to my file. Let me get back in a few moments to say if it works. – Letin Feb 28 '13 at 11:13
0

as.numeric(as.character(data$variable2[1:5])), use as.character to get string representation of labels of factor variable first, then convert them with as.numeric

Qbik
  • 5,885
  • 14
  • 62
  • 93