2

The first part of this assignment is to import data from an external http site; the data contains eight variables with 1,339 observations. Four of those variables (age, height, weight, igf1) contain NA values within them (NOTE: the other variables may also have NA values, but I'm not concerned with them). I need to eliminate the NA values in those four variables: this is where I'm struggling.

Here is what I have so far:

#imports dataset from internet
importData <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", sep=',', header=T)
#inspects the data:
str(importData)

Basically, I want to remove ALL NA values in age, height, weight, and igf1. I'll know I'm successful when I have 858 observations remaining.

Three of the variables (height, weight, igf1) contain FACTOR type information. One of the variables (age) contains numeric information. I have been unable to successfully implement complete.cases and/or na.omit across them: those functions only seem to work on $age, where they eliminate the five NA values (but don't touch the other variables)

I need help cutting out the NA values in the remaining variables. Again, when I'm done, I should have 858 observations.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
thebonafortuna
  • 103
  • 2
  • 2
  • 8
  • Welcome to SO! You should try to provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). As to your question, when you say "remove", do you mean: (a) removing the rows containing `NA` in all four of them? (b) removing the rows containing `NA` in any of them? or (c) changing the `NA` to another value (imputing)? – Hugh Apr 03 '14 at 01:27
  • I have occasionally somehow managed to convert NA's into a character string "NA" without realising, causing all sorts of downstream problems. Check that you haven't managed the same. – JeremyS Apr 03 '14 at 01:27
  • Thank you, Hugh. To your question, I mean (A) removing rows containing NA across all four variables. I want to remove the rows entirely, as opposed to replacing the values. – thebonafortuna Apr 03 '14 at 01:29
  • Can we see `str(data)`? (`complete.cases` *shouldn't* leave `NA`s in any of the columns, which isn't quite what you want, but it seems you have other problems.) – Hugh Apr 03 '14 at 01:35
  • Thanks Hugh and Richard. Please see edits above. – thebonafortuna Apr 03 '14 at 01:55
  • You have the answer in your call to `str`. All values are between quotes, and R interprets them as factors, not as numbers. `NA` is just a factor level, not a real `NA` – Julián Urbano Apr 03 '14 at 02:02

4 Answers4

7

The problem is that your data file encloses all values in quotes, like this:

"age", "height", "menarche", "sex", "igf1", "tanner", "testvol", "weight"
"NA", "NA", "NA", "NA", "90.0", "NA", "NA", "NA"
"NA", "NA", "NA", "NA", "88.0", "NA", "NA", "NA"

When R reads that, it takes all values as strings, which by default represent as factors. The NAs are just taken as the label for one level of those factors. You can see this with str:

> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : Factor w/ 600 levels " 110.8"," 111.5",..: 600 600 600 600 600 600 600 600 600 600 ...
 $ menarche: Factor w/ 3 levels " 1"," 2"," NA": 3 3 3 3 3 3 3 3 3 3 ...
 $ sex     : Factor w/ 3 levels " 1.00"," 2.00",..: 3 3 3 3 3 1 1 1 1 1 ...
 $ igf1    : Factor w/ 501 levels " 100.0"," 101.0",..: 490 487 53 55 23 2 498 6 10 474 ...
 $ tanner  : Factor w/ 6 levels " 1"," 2"," 3",..: 6 6 6 6 6 1 1 1 1 1 ...
 $ testvol : Factor w/ 26 levels " 1"," 10"," 11",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ weight  : Factor w/ 518 levels " 14.1"," 17.9",..: 518 518 518 518 518 518 518 518 518 518 ...

So read your data without making factors, which reads all as numeric:

> importData <- read.csv("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv",
                         stringsAsFactors=F, na.strings=c(NA,"NA"," NA"))
> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  NA NA NA NA NA 1 1 1 1 1 ...
 $ igf1    : num  90 88 164 166 131 101 97 106 111 79 ...
 $ tanner  : int  NA NA NA NA NA 1 1 1 1 1 ...
 $ testvol : int  NA NA NA NA NA NA NA NA NA NA ...
 $ weight  : num  NA NA NA NA NA NA NA NA NA NA ...

and now remove NAs:

> data <- importData[complete.cases(importData[c("age","height","weight","igf1")]),]
> str(data)
'data.frame':   858 obs. of  8 variables:
 $ age     : num  6 6.08 6.26 6.4 6.42 6.43 6.61 6.63 6.7 6.72 ...
 $ height  : num  112 117 120 116 116 ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ igf1    : num  98 242 196 179 126 142 236 148 174 136 ...
 $ tanner  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ testvol : int  1 1 1 1 1 1 1 2 1 1 ...
 $ weight  : num  19.1 21.7 24.7 19.6 20.6 20.2 28 21.6 26.1 22.6 ...
Julián Urbano
  • 8,378
  • 1
  • 30
  • 52
6

Why not make things a lot less stressful and read the .txt version of the data instead? I found it in the data list on the same page as the .csv. It isn't spaced abnormally like the .csv file and there is no fiddling around with read.csv arguments (which I did for quite a while before finding the other file).

con <- "http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.txt"
dat <- read.table(con, header = TRUE)
sapply(dat, class)
#       age    height  menarche       sex      igf1    tanner   testvol    weight 
# "numeric" "numeric" "integer" "numeric" "numeric" "integer" "integer" "numeric" 
cols <- c("age", "height", "weight", "igf1")
resultSet <- dat[complete.cases(dat[cols]), ]
dim(resultSet)
# [1] 858   8
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
4

Read in your data using as.is=TRUE and na.strings = c(NA, "NA", " NA")). Your file has spaces between fields which are interpreted as part of the field.

data <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", 
        sep=",", header=TRUE, as.is=TRUE, na.strings=c(NA, "NA", " NA"))

If you want to remove rows where any of the selected columns are NA use complete.cases:

new.data <- data[complete.cases(data[ ,c("age", "height", "weight", "igf1")]), ]
nrow(new.data)
# [1] 858

If you want to remove only the rows where all of the selected columns are NA, use rowSums:

new.data <- data[rowSums(is.na(data[ ,c("age", "height", "weight", "igf1")])) < 4, ]
nrow(new.data)
# [1] 1339   # there aren't any such rows :)

This isolated the four columns, tests whether they are NA, and for each row calculates how many are NA, returning only the ones that have fewer than 4 NAs in each row.

Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Thanks, Hugh. Implementing this, I'm still left with 1,339 observations - which is telling me it didn't remove the NA values? Please see example: > test1 <- importData[rowSums(is.na(importData[ , c("age", "height", "weight", "igf1")])) < 4, ] > str(test1) 'data.frame': 1339 obs. of 8 variables – thebonafortuna Apr 03 '14 at 02:08
  • Are you sure there are any rows to delete? I can't see any rows that fit the description. (There are only 5 rows which have `age` NA and they only have three `NA`s in the columns.) – Hugh Apr 03 '14 at 02:15
  • Thanks, Hugh. I just edited the everything above to hopefully make it more clear. – thebonafortuna Apr 03 '14 at 02:26
  • You actually wanted option (b), not option (a). – Hugh Apr 03 '14 at 02:28
  • Thank you, Hugh. I was more than a little unclear in the original post, sorry about that. Thanks again for the help! – thebonafortuna Apr 03 '14 at 02:33
0

If you are reading data from file, use the read.table option bellow:

read.table(...,quote="")

If you conditionally created NAs, like:

df[condition] <- NA

make sure you never use it quoted ("NA").

fred
  • 9,663
  • 3
  • 24
  • 34