Reading a value based off a different value in r

Question

Okay so I have a bit of a predicament and I know its gotta have a solution. I have a data sheet with 13 columns, however we will only be concerned with two (Fare and pClass). there are 1309 rows, 1308 have values for fare, and i want to find that missing value by basing the average off of the different classes (the pClass). so what i would like to have at is something that tells R to find a row where Fare = NA, read the value in pClass (1,2 or 3) and then find the average of that specified class then replace the missing value in Fare with that average

So to I guess summarize your mission whomever is valiant and kind enough to help me out. I want to find a missing value, figure out what class it is, average specifically that missing values class, and replace that missing value with the correct average

Using this instead of just finding the row that missing and reading it is a better avenue for when i have multiple missing values in R that i can replace with the correct average regardless of the deciding column.

Thank you for your time,

-Dylan

Okay so since this is WAY too specific to answer the original question heres the new plan boys (and girls and what ever else you wanna be idrc as long as you know what you're talking about). So! new plan is to make 3 variables corrisponding to the three different pClasses (1,2, and 3). each of these pClass averages (gonna call 'em pClassAVG.(x) where x = 1, 2, or 3) then have R find the missing values of the fares and replace them with the pClass variable (average) of the corrisponding pClass R's thought process should look like this "Okay, missing value. Whats the pClass? okay it is 2 so we should replace the missing value with pClassAVG.2"

Last time I got -1 for not including my code so here it is

    setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the 
#first line as data but as a title
#currently reads incorrectly

titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new column to tell us if it is the train set or test set

titanic.test$Survived <- NA
#makes a new column and fills it with NA to make the columns line up and have the same names

titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes

age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assigns it the median of the age column excluding the missing values (if we included missing
#values it would break bc its adding an undefined number)
#this method is better for replacing data that can change for example real time data that changes over the course of the day and your 
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.

titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the column age of titanic.full and returns true if there are missing value

pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1 )
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2 )

The last two lines are my attempt at telling it to make the aformentioned pClassAVG.1 and pClassAVG.2

[A reproducible example w/ your data would be helpful](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Tung, Oct 13 '17 at 23:13
Dylan, for your next questions, please take a look at this link that @thecatalyst just provided — Thai, Oct 13 '17 at 23:31

score 0 · Answer 1 · answered Oct 13 '17 at 23:13

0

df <- data_frame(Fare=c(10,20,30,40,50,60,NA,70,80), pClass=c(1,2,3,1,2,3,1,2,3))

a <- df$pClass[which(is.na(df$Fare))] # find the pClass where Fare is missing

df$Fare[which(is.na(df$Fare))] <-   mean(df$Fare[df$pClass==a], na.rm=T) # replace the missinf Fare with mean of corresponding pClass

This works only if there is one value of Fare missing

answered Oct 13 '17 at 23:13

Swapnil

164
8

what does Fare=c and pClass=c do? – Dylan Oct 31 '17 at 19:49
@Dylan c() creates a vector, which is then assigned to variables Fare and pClass. These variables are then used as columns to create df – Swapnil Nov 01 '17 at 22:53

score 0 · Answer 2 · answered Oct 13 '17 at 23:22

This must work... let me know if it doesn't

Probably there are more elegant solutions with apply ... but this works as well

#Creating a data frame named df
fare<- c(6,8,3,NA,5,1,0,7,NA,4,1,8,6,NA,2)
pclass<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
df<-as.data.frame(cbind(fare,pclass))

#Creating a loop to look at each row
for(i in 1:length(df$fare)){

#And if the value for fare is missing
if(is.na(df$fare[i])){

#then, replace with the mean according to the group defined in pclass
df$fare[i]<- mean(df$fare[df$pclass==df$pclass[i]],na.rm = TRUE)

 }
}

Reading a value based off a different value in r

2 Answers2