1

New-ish to R, I have a question about data cleaning.

I have a column that contains what type of drive a car is - four wheel, all wheel, 2 wheel etc

The problem is there is no standardization, so some rows have 4 WHEEL drive, 4wd, 4WD, Four - Wheel - Drive, etc

The first step is easy, which is to uppercase everything but the step I'm having trouble with is changing each value to a standard, like 4WD, without having to recode each unique drive.

Something like For Each value in column, if value LIKE/CONTAINS "FOUR" change to "4WD".

I've researched recode and stringdist and mutate but I can't find a fit. When I typed it out it sounds like I need a loop but not sure the exact syntax.

If the solution could work with the tidyverse that would be great!

tshurtz
  • 11
  • 1
  • 2

2 Answers2

3

Welcome to StackOverflow! I've answered your question, but in the future, please include a small sample of your data so it's easier for us to solve your problem. Food for thought: How to make a reproducible example

require(plyr)
require(dplyr)


# Since you haven't provided a data sample, I'm going to assume your dataframe is named "DF" and your column's name is "Drive"

# Set everything to lowercase to pare down uniqueness
DF <- mutate(DF, Drive = replace(Drive, Drive, tolower(Drive)))


# You'll need one line like this for each replacement.  Of the following form:
#     <column_name> = replace(<column_name>, <condition>, <new value>)
DF <- mutate(DF, Drive = replace(Drive, Drive == "4 wheel drive", "4WD"))
Punintended
  • 727
  • 3
  • 7
  • In this column there are ~45 unique values so doing that for each is tiresome and with how the file is created, there could be different variations that aren't captured. Want to stay away from hard-coding it, that's why I wanted it to have LIKE/CONTAINS functionality. – tshurtz Feb 05 '18 at 18:48
2

You can use ifelse and grepl. Change the first argument of grepl to something that will match all your desired cases. Below searches for strings containing "4" or "FOUR"

df$cleaned_col <- ifelse(grepl('4|four', df$colname_here, ignore.case = T), '4WD', df$colname_here)

If you want to do multiple comparisons you may want to use dplyr::case_when with %like% from data.table

require(dplyr);require(data.table)
df %>% mutate(cleaned = case_when(colname %like% 'a|b' ~ "there's an a or b in there"
                                  , colname %like% 'c' ~ "has a c in it"
                                  , T ~ "no a or b or c"))
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • Does `dplyr` part add anything? You main functions are `ifelse` and `grepl`, that's all you need. Also, you can add `all` to patterns. – pogibas Feb 05 '18 at 18:45
  • @RobJensen If I wanted to do multiple arguments, so '4 | FOUR' = "4WD", 'All Wheel | A' = "AWD" How would I do that? – tshurtz Feb 05 '18 at 18:51
  • @RobJensen that's perfect. Exactly what I needed. I had tried case_when but I didn't have %like%. Thanks so much! Many cleaning to do :-) – tshurtz Feb 05 '18 at 19:06