Okay so I have a bit of a predicament and I know its gotta have a solution. I have a data sheet with 13 columns, however we will only be concerned with two (Fare and pClass). there are 1309 rows, 1308 have values for fare, and i want to find that missing value by basing the average off of the different classes (the pClass). so what i would like to have at is something that tells R to find a row where Fare = NA, read the value in pClass (1,2 or 3) and then find the average of that specified class then replace the missing value in Fare with that average
So to I guess summarize your mission whomever is valiant and kind enough to help me out. I want to find a missing value, figure out what class it is, average specifically that missing values class, and replace that missing value with the correct average
Using this instead of just finding the row that missing and reading it is a better avenue for when i have multiple missing values in R that i can replace with the correct average regardless of the deciding column.
Thank you for your time,
-Dylan
Okay so since this is WAY too specific to answer the original question heres the new plan boys (and girls and what ever else you wanna be idrc as long as you know what you're talking about). So! new plan is to make 3 variables corrisponding to the three different pClasses (1,2, and 3). each of these pClass averages (gonna call 'em pClassAVG.(x) where x = 1, 2, or 3) then have R find the missing values of the fares and replace them with the pClass variable (average) of the corrisponding pClass R's thought process should look like this "Okay, missing value. Whats the pClass? okay it is 2 so we should replace the missing value with pClassAVG.2"
Last time I got -1 for not including my code so here it is
setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the
#first line as data but as a title
#currently reads incorrectly
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new column to tell us if it is the train set or test set
titanic.test$Survived <- NA
#makes a new column and fills it with NA to make the columns line up and have the same names
titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes
age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assigns it the median of the age column excluding the missing values (if we included missing
#values it would break bc its adding an undefined number)
#this method is better for replacing data that can change for example real time data that changes over the course of the day and your
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.
titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the column age of titanic.full and returns true if there are missing value
pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1 )
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2 )
The last two lines are my attempt at telling it to make the aformentioned pClassAVG.1 and pClassAVG.2