Remove Trailing Whitespace and Consolidate Potentially Duplicated Factors in R

Question

I want to change level name (eg "Africa " ) to another already available level (eg "Africa") in categorical variable (e.g. with the same descriptor, some factors have trailing whitespace while others do not). These variables, in the Continent column, are currently stored as factors in a dataframe.

Here are the output of my gigantic dataset

I tried series ifelse but I got weird results:

data.CONTINENT$Continent_R<- ifelse (data.CONTINENT$Continent=="Africa ","Africa",
                                     ifelse (data.CONTINENT$Continent=="Asia ","Asia", 
                                       ifelse (data.CONTINENT$Continent=="Europe ","Europe",         
                                         ifelse (data.CONTINENT$Continent=="Europe ","Europe", 
                                        ifelse (data.CONTINENT$Continent=="Multi ","Multi", 
                                                ifelse (data.CONTINENT$Continent=="North America ","North America", 
                                                        ifelse (data.CONTINENT$Continent=="South America ","South America", 
                                                                data.CONTINENT$Continent))))))); table (data.CONTINENT$Continent_R)

Here is what I got based on the prior code:

Any Advice will be greatly appreciated.

It is better to provide data as a `dput`(use `dput(head(df,n))` since this is copy-pastable. Images are mostly for plots and even then code and data should be supplied. I suggest editing to remove the images. Simply copy and paste the output to your question. — NelsonGon, Sep 05 '19 at 16:17
Can you just use `data.CONTINENT$Continent_R <- trimws(data.CONTINENT$Continent_R)` — shwan, Sep 05 '19 at 16:18
@shwan wrapped in `factor`? Like in `factor(trimws(data.CONTINENT$Continent_R))`. — Rui Barradas, Sep 05 '19 at 16:32
This works `data.CONTINENT$Continent <- trimws(data.CONTINENT$Continent); table (data.CONTINENT$Continent)` Thanks a lot — Mohamed Rahouma, Sep 05 '19 at 16:37
You could also use `new_var <- DescTools::StrTrim(data.CONTINENT$Continent)` (if it's just white space you're worried about). — Cole Robertson, Sep 05 '19 at 16:53
The difference between this and the post it's marked as a duplicate of is that simply assigning back the result of `trimws` will make that vector a character vector instead of still a factor, and the question seems specific to factors. But it should be fairly trivial to adapt to factors — camille, Sep 05 '19 at 18:31

score 4 · Answer 1 · answered Sep 05 '19 at 16:22

I would use the amazing forcats package.

library(forcats)

data.CONTINENT$Continent_R <- fct_collapse(data.CONTINENT$Continent_R,
                                            Africa = c("Africa", "Africa "),
                                            `South America` = c("South America", "South America "))

Programtically if all you wanted to do was to remove the trailing whitespace, you could do something like:

# where the regex '\\s?$' = remove one or none spaces before the end of the string
data.CONTINENT$Continent_R %>% fct_relabel(~ gsub("\\s+$", "", .x))

I used your code `data.CONTINENT$Continent_R %>%forcats::fct_relabel(~ gsub("\\s+$", "", .x)) ` and it works. Thanks a lot — Mohamed Rahouma, Sep 05 '19 at 16:30

score 4 · Answer 2 · answered Sep 05 '19 at 17:28

If all you're trying to do is remove whitespace, just use the base trimws function (or stringr::str_trim, although I don't know what advantage it has, if any). Replace the levels with their trimmed versions.

You didn't include a reproducible version of data, so I'm creating it by pasting continent names with randomly sampled empty strings or single spaces.

set.seed(123)
data.CONTINENT <- data.frame(
  Continent = paste0(sample(c("Africa", "Asia", "South America"), 10, replace = T), 
                     sample(c("", " "), 10, replace = T))
)

levels(data.CONTINENT$Continent)
#> [1] "Africa"         "Asia"           "Asia "          "South America" 
#> [5] "South America "

Version one: replace the labels with their trimmed versions, and set it back to being a factor.

factor(data.CONTINENT$Continent, labels = trimws(levels(data.CONTINENT$Continent)))
#>  [1] South America South America South America Asia          South America
#>  [6] Asia          Asia          Asia          South America Africa       
#> Levels: Africa Asia South America

Version two: use forcats and just pass the name of the function you need applied to the labels. Gets same output as above.

forcats::fct_relabel(data.CONTINENT$Continent, trimws)

score 3 · Answer 3 · answered Sep 05 '19 at 16:45

There are a lot of potential approaches here. You could:

Manually replace them one at a time:

data.CONTINENT$Continent[which(data.CONTINENT$Continent=="Africa ")] <- "Africa"

Use a look-up table to replace them all at once:

lut <- data.frame(old = c('Africa ', 'South America '),
                  new = c('Africa', 'South America'))

# copy data to a new column to avoid over-writing data
data.CONTINENT$Continent_R <- data.CONTINENT$Continent
# replace only the 'old' values with the 'new' values in the look-up-table
data.CONTINENT$Continent_R[which(data.CONTINENT$Continent %in% lut$old)] <- lut$new[match(data.CONTINENT$Continent[which(data.CONTINENT$Continent %in% lut$old)], lut$old)]
# You may want to re-factor the column after this if you want to use it as a factor variable so as to remove the old factors that are no longer present.

If the only issues are extra spaces before and/or after entries, then you can just use the trimws() function.

Use the dplyr::recode() function.

data.CONTINENT$Continent_R <- dplyr::recode(data.CONTINENT$Continent, 'Africa ' = 'Africa', 'South America ' = 'South America')

And there are probably 20 other ways of doing things using functions like dplyr::join or switch.

Remove Trailing Whitespace and Consolidate Potentially Duplicated Factors in R

3 Answers3