I have this CSV dataset and I need to create a function to perform data cleaning but still not working and I am running out of idea.
Here is the dataset on Google Drive.
Here is what I need to do:
- Correcting possible typos
- Removing irrelevant data (only houses in Auckland and Wellington are considered)
- Removing outliers, e.g. negative area, negative power consumptions, very high areas, very high power consumptions
So far this is the code I have done:
# Reading data set
installed.packages("lubridate")
library(lubridate)
# Reading data set
power <- read.csv("data set 6.csv", na.strings="")
# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)
#Data Cleaning
levels(power$City) <- c(levels(power$City), "Auckland")
power$City[power$City == "Ackland"] <- "Auckland"
#Removing irrelevant data (only houses in Auckland and Wellington are considered)
power$City <- power$City[-c(496,499), ]
After I run this code, the misspelled words ("Ackland") does not change to Auckland as I expected.
This highlighted row as shown in this image is supposed to change to Auckland: