I'm using a dataset with an age group variable. I'm trying to visualise the total purchasing value of different age groups on a bar chart but there are two bars for the same age groups. I checked to see if there were repeat age groups with the unique() function and it has told me that there are two "unique" 18-24 age groups. I want this to be shown in a single bar but it shows me two. Could anyone tell me what's going on?
Asked
Active
Viewed 26 times
0

stefan
- 90,330
- 6
- 25
- 51

Milan Chinnick
- 11
- 1
-
4If you have a closer look you see that a different symbol is used to separate the ages, i.e. a smaller and a longer hyphen or dash. To get only one bar per age group, you have to recode your age categories, i.e. replace e.g. the longer dash by the shorter. For more help I would suggest to provide [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data. – stefan Apr 26 '23 at 15:40
1 Answers
0
All characters are represented numerically in a computer. Different systems exist known as encodings.
The duplicated categories are not really duplicates as they contain different dash characters. They may seem similar to human eyes and equivalent in meaning, but to the computer they are as different as 'a' and 'b' are.
You can transform your age
variable to unify the groups using a regex that identifies all possible dash characters and replacing them by the common dash.
(I'm assuming that the ages group are stored in an age
column of a data
data.frame.)
data <- data.frame(
age = c(
"18-24",
"18–24",
"18-24",
"25-34"
)
)
unique(data$age)
# [1] "18-24" "18–24" "25-34"
# Using base-r:
data$age <- sub("\\p{Pd}", "-", data$age, perl = TRUE)
unique(data$age)
# [1] "18-24" "25-34"
# Using `stringr`:
library(stringr)
data$age <- str_replace(data$age, "\\p{Pd}", "-")
unique(data$age)
# [1] "18-24" "25-34"

Santiago
- 641
- 3
- 14