0

I'm working with prostate gene expression data (http://icos.cs.nott.ac.uk/datasets/microarray.html) in R and trying to convert all of the entries into numeric to write a similarity function. How do I convert all of the entries from factors to numerics with just expression value? If I index into the data frame as such,

> prostate[5,4]
[1] 3.17469778457247
2093 Levels: 0.133822364738809 ... normal

I just want the value, 3.17...

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
Dan Schwartz
  • 31
  • 1
  • 3
  • not a duplicate as such because the daft file format still leaves him a last row of sample names he probably didn't know about, and would mess up each simple conversion he tried. – Stephen Henderson Dec 02 '13 at 22:28

1 Answers1

1

That file has character data on the last line. When R read it in it turned everything into factors as it's not numeric. In bash you can see this:

tail -2 prostate_preprocessed.txt
AFFX-YEL021w/URA3_at 3.31255956783592 4.05800228545385 4.26348960812486 4.2180869800299 4.90599509636775 4.33488048792038 4.96535865133757 4.35350385526143 4.18529970123263 3.85103067777549 4.03836053811841 3.70345720098741 4.11379278781317 4.01121240340167 4.68296544299334 4.33584797205546 4.16864882878781 4.32781853396998 3.85145280458377 3.76586006943253 4.67388887037993 3.87182653639402 3.74997314075837 3.94258426954186 ...
tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor

but you can fix it by only reading up to the penultimate line (bash again):

wc -l prostate_preprocessed.txt
    2136 prostate_preprocessed.txt

in R now:

> prostate=read.table("prostate_preprocessed.txt", nrows=2135)
> prostate[4,5]
[1] 6.379761

EDIT ps it is a bizarre file format as you probably want the tumour values in the last row as column headers:

> cn=read.table("prostate_preprocessed.txt", skip=2135, colClasses="character")
> colnames(prostate)<-cn[1,]
Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33