Visualising data with categorical variables

Question

I'm using a dataset with an age group variable. I'm trying to visualise the total purchasing value of different age groups on a bar chart but there are two bars for the same age groups. I checked to see if there were repeat age groups with the unique() function and it has told me that there are two "unique" 18-24 age groups. I want this to be shown in a single bar but it shows me two. Could anyone tell me what's going on?

If you have a closer look you see that a different symbol is used to separate the ages, i.e. a smaller and a longer hyphen or dash. To get only one bar per age group, you have to recode your age categories, i.e. replace e.g. the longer dash by the shorter. For more help I would suggest to provide [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data. — stefan, Apr 26 '23 at 15:40

score 0 · Answer 1 · answered Apr 26 '23 at 19:28

All characters are represented numerically in a computer. Different systems exist known as encodings.

The duplicated categories are not really duplicates as they contain different dash characters. They may seem similar to human eyes and equivalent in meaning, but to the computer they are as different as 'a' and 'b' are.

You can transform your age variable to unify the groups using a regex that identifies all possible dash characters and replacing them by the common dash.

(I'm assuming that the ages group are stored in an age column of a data data.frame.)

data <- data.frame(
  age = c(
    "18-24",
    "18–24",
    "18-24",
    "25-34"
  )
)

unique(data$age)
# [1] "18-24" "18–24" "25-34"

# Using base-r:
data$age <- sub("\\p{Pd}", "-", data$age, perl = TRUE)

unique(data$age)
# [1] "18-24" "25-34"

# Using `stringr`:
library(stringr)

data$age <- str_replace(data$age, "\\p{Pd}", "-")

unique(data$age)
# [1] "18-24" "25-34"

Visualising data with categorical variables

1 Answers1