Trying to remove duplicate strings from single observation observation to limit amount of factors

Question

I'm having difficulty trying to reduce the amount of factors from aggregated data. Long story short, I grouped a variety of car damage repair data to understand what maintenance has been done on a car. The issue with this is that it now contains duplicate strings if a certain aspect of the car has been worked on multiple times.

I'm trying to do this using str_replace and regular expressions. I have found a way to remove duplicates, but it only spits out a vector rather than replacing each single observation from my data frame.

Example data can be found below:

UNITNUMBER <- c(1,2,3,4,5,6,7,8,9,10)
MAINTENANCE_TYPE <- c("ELECTRIC BODY ELECTRIC", "ELECTRIC ACCESSORY BODY BODY", "ACCESSORY BODY ACCESSORY", "BODY ELECTRIC",
                      "ACCESSORY CHASSIS ELECTRIC CHASSIS", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS CHASSIS BODY",
                      "ELECTRIC ACCESSORY ELECTRIC BODY BODY CHASSIS", "BODY","ELECTRIC ELECTRIC")

df<-  data.frame(UNITNUMBER,MAINTENANCE_TYPE)

I'd like the final output to be as follows in alphabetical order (if possible):

MAINTENANCE_TYPE <- c("BODY ELECTRIC", "ACCESSORY BODY ELECTRIC", "ACCESSORY BODY", "BODY ELECTRIC",
                      "ACCESSORY CHASSIS ELECTRIC", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS",
                      "ACCESSORY BODY CHASSIS ELECTRIC", "BODY","ELECTRIC")

Is this possible?

I've tried all sorts of str_replace functions using regex and have been hitting my head against the wall! Any help is appreciated.

Could you please include what has already failed? With recent sad [changes](https://meta.stackoverflow.com/questions/391250/upvotes-on-questions-will-now-be-worth-the-same-as-upvotes-on-answers), it is important that questions have more information. — NelsonGon, Nov 14 '19 at 03:44
@NelsonGon, I totally realize what I provided doesn't make for the best post, but I got so frustrated and deleted my attempts. I'm going to test what was provided as an answer and will update my question if it does not work. Thanks for the interest in answering my question. — Doug, Nov 14 '19 at 15:30

Ronak Shah · Accepted Answer · 2019-11-14T03:24:59.867

You can use regex here with gsub to find any repetitive words and remove them.

trimws(gsub("(\\b\\S+\\b)(?=.*\\1)", "", df$MAINTENANCE_TYPE, perl = TRUE))

# [1] "BODY ELECTRIC"  "ELECTRIC ACCESSORY  BODY"  "BODY ACCESSORY"                 
# [4] "BODY ELECTRIC" "ACCESSORY  ELECTRIC CHASSIS" "ACCESSORY BODY ELECTRIC"
# [7] "CHASSIS BODY"  "ACCESSORY ELECTRIC  BODY CHASSIS" "BODY"                        
#[10] "ELECTRIC"

Regex taken from here .

A standard approach would be to split the string on every word, get unique words and paste them together.

sapply(strsplit(as.character(df$MAINTENANCE_TYPE), "\\s+"), function(x) 
             paste(sort(unique(x)), collapse = " "))

# [1] "BODY ELECTRIC"  "ACCESSORY BODY ELECTRIC"   "ACCESSORY BODY"         
# [4] "BODY ELECTRIC"  "ACCESSORY CHASSIS ELECTRIC" "ACCESSORY BODY ELECTRIC"
# [7] "BODY CHASSIS" "ACCESSORY BODY CHASSIS ELECTRIC" "BODY"                 
#[10] "ELECTRIC"

Trying to remove duplicate strings from single observation observation to limit amount of factors

1 Answers1