-1

I'm taking an Advanced Business Analysis class for school and we're learning to program in R Studio.

The professor shared a hint to help us solve a problem, but I'm unable to get it to work.

I'm trying to set the mean height by gender for any height values that contain NA.

Here's what the professor shared as a solution to the problem, but it doesn't work. Nothing gets updated in the data table:

data$height[is.na(data$height) && data$gender == "female"] = data$height[data$gender=="female"]

I tried this:

data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"])

and this:

data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)

But got this error:

In mean.default(data$height[data$gender == "female"]) :  argument is not numeric or logical: returning NA

I calculated the mean height of each gender and tried it this way, but that didn’t work either. In all scenarios, the height still displays “NA”.

femaleMeanHeight = mean(data$height[data$gender=="female"], na.rm = TRUE)
data$height[is.na(data$height) && data$gender == "female"] = femaleMeanHeight

I don't know where else to go. Any help is greatly appreciated.

camille
  • 16,432
  • 18
  • 38
  • 60
LouC
  • 3
  • 1
  • 1
    It's hard to know exactly how to help without a [reproducible example](https://stackoverflow.com/q/5963269/5325862) that shows exactly what you're trying to do and what's going wrong – camille Sep 09 '20 at 14:35

1 Answers1

0

There two problems with your code. The first is indata$height[is.na(data$height) && data$gender == "female"] and the second is in mean(data$height[data$gender=="female"]).

We start with the second problem - you already solved it. Calculating a mean and including NA will result in NA. Therefore you set rm.na = TRUE, so the NAs will be ignored. (Replacing NA with NA doesn't make sense or a difference )

The first problem is the && part. There is a difference between & and &&. Just use & instead of && and your code might run.

data$height[is.na(data$height) & data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)

Like I mentioned && and & have different meanings.

& does exaclty what you want. It tests for every row if your two conditions are true or false (Is height NA and is gender female?). The result will be a vector (for each row one logical) for example TRUE, FALSE, TRUE, FALSE (The first and third row meet the condtions). The new mean height will just overwrite the height in the rows with TRUE. --> That's what you want.

&& will only test the first row. So you just get one TRUE or FALSE. If your first row has NA in height and female in gender you get a TRUE. And your whole dataset will be overwritten with the mean (data$height[TRUE] - would mean everything in the column height). If your first row is not female or height has a value, the result will be FALSE. So no height will be overwriten with the mean height.

So the reason for nothing worked might be that your first row didn't match your conditions - therefore the result was FALSE. And overwrite data$height[FALSE] with mean implys replace NA with the mean height in no row at all.

tamtam
  • 3,541
  • 1
  • 7
  • 21