0

I wish to convert missing observations in a comma delimited data set to NA when using read.csv. I thought this was a trivial task. However, when I read the data set below, NA's do not appear in the last column.

A similar question was asked here: Change the Blank Cells to "NA"

When I use a solution suggested at that other post I get my expected result (shown below). However, I do not understand why my original code (also shown below) does not work. Why do I even need to use the solution suggested at the other post? In other words, why does:

is.na(my.data$my.method[1])

return FALSE, as shown below?

Here is the data set:

ID,my.date,Ref,Zone,Group,Fruit,Area,Rating,Quality,Age,Sex,my.method
1,14-Aug-2016,SSS,1,2,115,Idaho,4,4,Adult,Unknown,
1,20-Aug-2015,SSS,1,2,144,Ohio,4,3,Adult,Unknown,
2,14-Aug-2012,TTT,1,2,115,Hawaii,4,3,Adult,Male,BBB
3,6-Jun-2015,RRR,1,2,239,Florida,4,3,Adult,Male,BBB
4,26-Jul-2016,SSS,1,1,80,Hawaii,4,4,Adult,Male,AAA
4,1-Aug-2015,GGG,2,1,83,Ohio,4,4,Adult,Male,AAA
5,5-Apr-2015,SSS,2,1,171,Idaho,4,4,Adult,Female,AAA

Note that when I select the data set from this post there appear to be blank spaces after the last character to the right, but after the text is copied and pasted into a text file on a Windows computer there are no visible spaces. I checked to be sure my issue can be reproduced by coping and pasting the data set above.

Here is my R code:

setwd('C:/Users/mmiller/Documents/simple R programs/')

my.data <- read.csv('confusing_NA_sample_data_for_stackoverflow.csv', 
           header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")
my.data
#  ID     my.date Ref Zone Group Fruit    Area Rating Quality   Age     Sex my.method
#1  1 14-Aug-2016 SSS    1     2   115   Idaho      4       4 Adult Unknown          
#2  1 20-Aug-2015 SSS    1     2   144    Ohio      4       3 Adult Unknown          
#3  2 14-Aug-2012 TTT    1     2   115  Hawaii      4       3 Adult    Male       BBB
#4  3  6-Jun-2015 RRR    1     2   239 Florida      4       3 Adult    Male       BBB
#5  4 26-Jul-2016 SSS    1     1    80  Hawaii      4       4 Adult    Male       AAA
#6  4  1-Aug-2015 GGG    2     1    83    Ohio      4       4 Adult    Male       AAA
#7  5  5-Apr-2015 SSS    2     1   171   Idaho      4       4 Adult  Female       AAA

my.data$my.method[1]
#[1] ""

is.na(my.data$my.method[1])
#[1] FALSE

my.data$my.method[1] == ''
#[1] TRUE

Here is what I was expecting:

expected.result <- read.table(text = '
  ID     my.date Ref Zone Group Fruit    Area Rating Quality   Age     Sex my.method
   1 14-Aug-2016 SSS    1     2   115   Idaho      4       4 Adult Unknown        NA
   1 20-Aug-2015 SSS    1     2   144    Ohio      4       3 Adult Unknown        NA
   2 14-Aug-2012 TTT    1     2   115  Hawaii      4       3 Adult    Male       BBB
   3  6-Jun-2015 RRR    1     2   239 Florida      4       3 Adult    Male       BBB
   4 26-Jul-2016 SSS    1     1    80  Hawaii      4       4 Adult    Male       AAA
   4  1-Aug-2015 GGG    2     1    83    Ohio      4       4 Adult    Male       AAA
   5  5-Apr-2015 SSS    2     1   171   Idaho      4       4 Adult  Female       AAA
', header = TRUE)
expected.result

I can obtain the expected result using the following code:

my.data <- read.csv('confusing_NA_sample_data_for_stackoverflow.csv', 
           header = TRUE, stringsAsFactors = FALSE, na.strings = c("", "NA"))
my.data

  ID     my.date Ref Zone Group Fruit    Area Rating Quality   Age     Sex my.method
1  1 14-Aug-2016 SSS    1     2   115   Idaho      4       4 Adult Unknown      <NA>
2  1 20-Aug-2015 SSS    1     2   144    Ohio      4       3 Adult Unknown      <NA>
3  2 14-Aug-2012 TTT    1     2   115  Hawaii      4       3 Adult    Male       BBB
4  3  6-Jun-2015 RRR    1     2   239 Florida      4       3 Adult    Male       BBB
5  4 26-Jul-2016 SSS    1     1    80  Hawaii      4       4 Adult    Male       AAA
6  4  1-Aug-2015 GGG    2     1    83    Ohio      4       4 Adult    Male       AAA
7  5  5-Apr-2015 SSS    2     1   171   Idaho      4       4 Adult  Female       AAA

However, as stated above I do not understand why:

is.na(my.data$my.method[1])
#[1] FALSE

if I do not use the solution suggested at the other post. Thank you for any explanation.

Mark Miller
  • 12,483
  • 23
  • 78
  • 132
  • 1
    In your first example, `my.data$my.method[1]` the value is not `NA` but rather an empty string `""` which is not the same as being `NA` in R. In the second example the `read.csv` argument `na.strings = c("", "NA")` makes it so the empty strings `""` in your file are interpreted as `NA` values. – Matt Jewett Jun 13 '17 at 19:43
  • Thank you. It seems confusing to me that an empty string is not the same thing as a missing observation. Nevertheless, I appreciate the explanation. – Mark Miller Jun 13 '17 at 19:46
  • 2
    Maybe helpful to think about `"" == ""` versus `NA == NA`. Also checking out `?NA`, there is mention of `NA_character_` and `NA_character_ == NA_character_` or `NA_character_ == ""`. – lmo Jun 13 '17 at 20:00
  • 1
    Does it have to be a base R (i.e. `read.csv()`) solution? `read_csv()` from the `readr` package takes an `na = ` argument which allows you to specify what missing values look like in your data. – Phil Jun 13 '17 at 20:51
  • No, it does not have to be a base `R` solution. Although, I usually prefer base `R`. – Mark Miller Jun 13 '17 at 20:55

0 Answers0