0

I have this CSV dataset and I need to create a function to perform data cleaning but still not working and I am running out of idea.

Here is the dataset on Google Drive.

Here is what I need to do:

  • Correcting possible typos
  • Removing irrelevant data (only houses in Auckland and Wellington are considered)
  • Removing outliers, e.g. negative area, negative power consumptions, very high areas, very high power consumptions

So far this is the code I have done:

# Reading data set
installed.packages("lubridate")
library(lubridate)

# Reading data set
power <- read.csv("data set 6.csv", na.strings="")

# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)

#Data Cleaning
levels(power$City) <- c(levels(power$City), "Auckland")
power$City[power$City == "Ackland"] <- "Auckland"

#Removing irrelevant data (only houses in Auckland and Wellington are considered)
power$City <- power$City[-c(496,499), ]

After I run this code, the misspelled words ("Ackland") does not change to Auckland as I expected. This highlighted row as shown in this image is supposed to change to Auckland: enter image description here

Nelson
  • 41
  • 1
  • 8
  • Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example. – Marcus Müller Oct 15 '17 at 21:19
  • See function `?droplevels`. – Rui Barradas Oct 15 '17 at 21:41
  • @MarcusMüller I hope the image I uploaded could give an idea of what I expect – Nelson Oct 16 '17 at 01:21
  • `factor`s can be confusing. You probably don't need them. Do `power$City = as.character(power$City)` and things should work more like you expect. Alternatively, add the `stringsAsFactors=False` argument to `read.csv`. – kdauria Nov 11 '17 at 16:04

1 Answers1

2

To address your issue collapsing factor levels 'Ackland' and 'Auckland' (and also assuming you want power$City to be/remain a factor):

One method is to pass the levels() function a named list, each name being the correct labels of the desired levels (in your case the correct names of the cities in your data set) see: Cleaning up factor levels (collapsing multiple levels/labels) for a general example.

However, just as a heads up, watch for the extra space behind the Ackland and Auckland character strings in your data set:

    # first view classes to confirm power$City is a factor
     > apply(power, class)     # --> or is.factor(power$City) will work to
        Area      City  P.Winter  P.Summer 
    "numeric"  "factor" "numeric" "numeric" 

    # Notice spaces behind "Ackland " and "Auckland "
     > levels(power$City)
    [1] "Ackland "   "Auckland "  "Sydney"     "Wellington"

Passing a named list to levels() works once you account for the spaces:

    levels(power$City) <-  list(Auckland = c("Ackland ", "Auckland "), Sydney = c("Sydney"), Wellington = c("Wellington"))

    # Now only three factor levels (notice this also took care of the extra spaces)
      > levels(power$City)
     [1] "Auckland"   "Sydney"     "Wellington"

You now have 3 levels instead of 4, notice this also took care of the spaces in the level labels

Subset to include only relevant cities

       subpower <- power[which(power$City == c("Auckland", "Wellington")), ]

You could also subset to exclude negative values, extreme values, etc...

Note: My only real contribution here is catching the extra spaces, to tackle similar problems myself I found Aaron's answer very helpful. Hope this helps!