R ignores some data on plots

Question

I have the following series of commands:

my_data = read.csv(file='r-stats.out', sep='\t', skip=1)
histsub = subset(my_data, my_data[,10] != "Invalid")
hist(as.numeric(histsub[,10]))

r-stats.out is a file that has 10 columns, and column number 10 (one which I am trying to plot) has numbers ranging from -2000 to 10000 or the word "Invalid" which I try to first filter out. For some reason, my histogram only has range from 0 to 2500 IGNORING everything else. Why? What is happening? I did a

print(histsub)

and everything looks okay, those numbers are there in the histsub, but not on the plot. Please help.

EDIT: Adding a few lines from my_data print and also from histsub: my_data:

38    629345  1  633201  0   -41 Invalid    0   g    0     -37
39    633201  0  628727  0  4496     323    0   g    0    4629
40    628727  0  631371  1  7835     202    0   g    0 Invalid
41    631371  1  625871  1  7317     112    0   g    0    7379
42    625871  1  633427  1  1351     348    0   g    0    1321

histsub:

38    629345  1  633201  0  -41 Invalid    0   g    0   -37
39    633201  0  628727  0 4496     323    0   g    0  4629
41    631371  1  625871  1 7317     112    0   g    0  7379
42    625871  1  633427  1 1351     348    0   g    0  1321

It's much easier to help if you provide a [minimal, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Check `str(my_data)`, are variables that you expect to be numeric really numeric, or have they been converted to factors when reading the data due to strings like "Invalid" among the numbers? — Henrik, Mar 23 '14 at 21:44

score 3 · Accepted Answer · answered Mar 23 '14 at 21:44

3

Try my_data[,10]=as.numeric(as.character(my_data[,10])) and then all the Invalid string entries will get converted to NA and won't show up in histograms anyway.

answered Mar 23 '14 at 21:44

Spacedman

92,590
12
140
224

score 2 · Answer 2 · answered Mar 23 '14 at 21:42

2

That implies its class is character, so it's probably implicitly converting to factor, and there are ~2500 uniques. Try using the argument stringsAsFactors = FALSE in read.csv

answered Mar 23 '14 at 21:42

Robert Krzyzanowski

9,294
28
24

The "right" way would have been to use the 'colClasses' argument to `read.csv`. – IRTFM Mar 23 '14 at 21:57

R ignores some data on plots

2 Answers2