I wish to convert missing observations in a comma delimited data set to NA
when using read.csv
. I thought this was a trivial task. However, when I read the data set below, NA
's do not appear in the last column.
A similar question was asked here: Change the Blank Cells to "NA"
When I use a solution suggested at that other post I get my expected result (shown below). However, I do not understand why my original code (also shown below) does not work. Why do I even need to use the solution suggested at the other post? In other words, why does:
is.na(my.data$my.method[1])
return FALSE
, as shown below?
Here is the data set:
ID,my.date,Ref,Zone,Group,Fruit,Area,Rating,Quality,Age,Sex,my.method
1,14-Aug-2016,SSS,1,2,115,Idaho,4,4,Adult,Unknown,
1,20-Aug-2015,SSS,1,2,144,Ohio,4,3,Adult,Unknown,
2,14-Aug-2012,TTT,1,2,115,Hawaii,4,3,Adult,Male,BBB
3,6-Jun-2015,RRR,1,2,239,Florida,4,3,Adult,Male,BBB
4,26-Jul-2016,SSS,1,1,80,Hawaii,4,4,Adult,Male,AAA
4,1-Aug-2015,GGG,2,1,83,Ohio,4,4,Adult,Male,AAA
5,5-Apr-2015,SSS,2,1,171,Idaho,4,4,Adult,Female,AAA
Note that when I select the data set from this post there appear to be blank spaces after the last character to the right, but after the text is copied and pasted into a text file on a Windows computer there are no visible spaces. I checked to be sure my issue can be reproduced by coping and pasting the data set above.
Here is my R
code:
setwd('C:/Users/mmiller/Documents/simple R programs/')
my.data <- read.csv('confusing_NA_sample_data_for_stackoverflow.csv',
header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
my.data
# ID my.date Ref Zone Group Fruit Area Rating Quality Age Sex my.method
#1 1 14-Aug-2016 SSS 1 2 115 Idaho 4 4 Adult Unknown
#2 1 20-Aug-2015 SSS 1 2 144 Ohio 4 3 Adult Unknown
#3 2 14-Aug-2012 TTT 1 2 115 Hawaii 4 3 Adult Male BBB
#4 3 6-Jun-2015 RRR 1 2 239 Florida 4 3 Adult Male BBB
#5 4 26-Jul-2016 SSS 1 1 80 Hawaii 4 4 Adult Male AAA
#6 4 1-Aug-2015 GGG 2 1 83 Ohio 4 4 Adult Male AAA
#7 5 5-Apr-2015 SSS 2 1 171 Idaho 4 4 Adult Female AAA
my.data$my.method[1]
#[1] ""
is.na(my.data$my.method[1])
#[1] FALSE
my.data$my.method[1] == ''
#[1] TRUE
Here is what I was expecting:
expected.result <- read.table(text = '
ID my.date Ref Zone Group Fruit Area Rating Quality Age Sex my.method
1 14-Aug-2016 SSS 1 2 115 Idaho 4 4 Adult Unknown NA
1 20-Aug-2015 SSS 1 2 144 Ohio 4 3 Adult Unknown NA
2 14-Aug-2012 TTT 1 2 115 Hawaii 4 3 Adult Male BBB
3 6-Jun-2015 RRR 1 2 239 Florida 4 3 Adult Male BBB
4 26-Jul-2016 SSS 1 1 80 Hawaii 4 4 Adult Male AAA
4 1-Aug-2015 GGG 2 1 83 Ohio 4 4 Adult Male AAA
5 5-Apr-2015 SSS 2 1 171 Idaho 4 4 Adult Female AAA
', header = TRUE)
expected.result
I can obtain the expected result using the following code:
my.data <- read.csv('confusing_NA_sample_data_for_stackoverflow.csv',
header = TRUE, stringsAsFactors = FALSE, na.strings = c("", "NA"))
my.data
ID my.date Ref Zone Group Fruit Area Rating Quality Age Sex my.method
1 1 14-Aug-2016 SSS 1 2 115 Idaho 4 4 Adult Unknown <NA>
2 1 20-Aug-2015 SSS 1 2 144 Ohio 4 3 Adult Unknown <NA>
3 2 14-Aug-2012 TTT 1 2 115 Hawaii 4 3 Adult Male BBB
4 3 6-Jun-2015 RRR 1 2 239 Florida 4 3 Adult Male BBB
5 4 26-Jul-2016 SSS 1 1 80 Hawaii 4 4 Adult Male AAA
6 4 1-Aug-2015 GGG 2 1 83 Ohio 4 4 Adult Male AAA
7 5 5-Apr-2015 SSS 2 1 171 Idaho 4 4 Adult Female AAA
However, as stated above I do not understand why:
is.na(my.data$my.method[1])
#[1] FALSE
if I do not use the solution suggested at the other post. Thank you for any explanation.