Data cleaning and misspelled words in a table

Question

I have this CSV dataset and I need to create a function to perform data cleaning but still not working and I am running out of idea.

Here is the dataset on Google Drive.

Here is what I need to do:

Correcting possible typos
Removing irrelevant data (only houses in Auckland and Wellington are considered)
Removing outliers, e.g. negative area, negative power consumptions, very high areas, very high power consumptions

So far this is the code I have done:

# Reading data set
installed.packages("lubridate")
library(lubridate)

# Reading data set
power <- read.csv("data set 6.csv", na.strings="")

# SUBSETTING
Area <- as.numeric(power$Area)
City <- as.character(power$City)
P.Winter <- as.numeric(power$P.Winter)
P.Summer <- as.numeric(power$P.Summer)

#Data Cleaning
levels(power$City) <- c(levels(power$City), "Auckland")
power$City[power$City == "Ackland"] <- "Auckland"

#Removing irrelevant data (only houses in Auckland and Wellington are considered)
power$City <- power$City[-c(496,499), ]

After I run this code, the misspelled words ("Ackland") does not change to Auckland as I expected. This highlighted row as shown in this image is supposed to change to Auckland:

Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example. — Marcus Müller, Oct 15 '17 at 21:19
@MarcusMüller I hope the image I uploaded could give an idea of what I expect — Nelson, Oct 16 '17 at 01:21
`factor`s can be confusing. You probably don't need them. Do `power$City = as.character(power$City)` and things should work more like you expect. Alternatively, add the `stringsAsFactors=False` argument to `read.csv`. — kdauria, Nov 11 '17 at 16:04

Josephine Rueckl · Accepted Answer · 2017-11-10T03:24:31.047

To address your issue collapsing factor levels 'Ackland' and 'Auckland' (and also assuming you want power$City to be/remain a factor):

One method is to pass the levels() function a named list, each name being the correct labels of the desired levels (in your case the correct names of the cities in your data set) see: Cleaning up factor levels (collapsing multiple levels/labels) for a general example.

However, just as a heads up, watch for the extra space behind the Ackland and Auckland character strings in your data set:

    # first view classes to confirm power$City is a factor
     > apply(power, class)     # --> or is.factor(power$City) will work to
        Area      City  P.Winter  P.Summer 
    "numeric"  "factor" "numeric" "numeric" 

    # Notice spaces behind "Ackland " and "Auckland "
     > levels(power$City)
    [1] "Ackland "   "Auckland "  "Sydney"     "Wellington"

Passing a named list to levels() works once you account for the spaces:

    levels(power$City) <-  list(Auckland = c("Ackland ", "Auckland "), Sydney = c("Sydney"), Wellington = c("Wellington"))

    # Now only three factor levels (notice this also took care of the extra spaces)
      > levels(power$City)
     [1] "Auckland"   "Sydney"     "Wellington"

You now have 3 levels instead of 4, notice this also took care of the spaces in the level labels

Subset to include only relevant cities

       subpower <- power[which(power$City == c("Auckland", "Wellington")), ]

You could also subset to exclude negative values, extreme values, etc...

Note: My only real contribution here is catching the extra spaces, to tackle similar problems myself I found Aaron's answer very helpful. Hope this helps!

Yes the extra spaces were one of the root problems to the issue — Nelson, Nov 11 '17 at 16:16

Data cleaning and misspelled words in a table

1 Answers1