Column values getting corrupted when subsetting

Question

I am having a problem with subsetting. When I subset my data set, several columns in the resulting subset are filled with 0's and the variable class for these columns has changed to unknown. This happens consistently with certain subsets. The columns affected vary between affected subsets

I don't understand why this is happening. All I am doing is a simple subset command. Why is R losing 4 whole columns of numerical data and replacing it with nonsense.

The offending piece of code is this simple command here:

table.al = subset(bamboo_compounds,bamboo_compounds$CClass=="aldehyde")

The original data set looks like this:

Screenshot

The resulting subset looks like this:

Screenshot

Those four columns should be filled with numerical data.

I have literally done nothing other than load in a .csv file and then make a subset of that data. Please, can someone give me some idea of might be causing this and how I can avoid it?

When asking for help, please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Pictures of data are not helpful. Also, no need to use `$` with subset: `subset(bamboo_compounds, CClass=="aldehyde")` — MrFlick, Dec 14 '16 at 19:48

score 1 · Answer 1 · answered Dec 15 '16 at 13:25

Are you sure your data have actually been corrupted? The only line we can see in the top view (of the entire data set) that's included in the subset is row 15: that reads

unknown aldehyde,aldehyde,yes,NA,0.00000,0.00000,0.00000,...

What appears in the lower view is

unknown aldehyde,aldehyde,yes,NA,0.00000,0.00000,0,0,...

that is, the only thing that I can see that's changed is the format of the last two columns (which is probably because all of the values for those columns in the subset are exactly zero, so there's no need to print all the decimal places).

As for the "unknown column type" thing, I think that's just an oddity of RStudio. When I enter this data set by hand

d <- read.csv(text=
 '"unknown aldehyde","aldehyde","yes",NA,0.0000,0.0000,0,0',
 header=FALSE)

and view it in RStudio I see those "unknown" labels on the last four columns. However, when I ask R what class those columns have, they're numeric (or integer).

sapply(d,class)
       V1        V2        V3        V4        V5        V6        V7 
 "factor"  "factor"  "factor" "logical" "numeric" "numeric" "integer" 
       V8 
"integer"

I haven't been able to find anything about this "column x: unknown" tag in the RStudio viewer (which is admittedly confusing); might be worth asking about this on the RStudio forums?

Column values getting corrupted when subsetting

1 Answers1