I'm using back references to get rid of accidental repeats in vectors of variable names. The names in the first case I encountered have repeat patterns like this
x <- c("gender_gender-1", "county_county-2", "country_country-1997",
"country_country-1993")
The repeats were always separated by underscore and there was only one repeat to eliminate. And they always start at the beginning of the text. After checking the Regular Expression Cookbook, 2ed, I arrived at an answer that works:
> gsub("^(.*?)_\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
I was worried that the future cases might have dash or space as separator, so I wanted to generalize the matching a bit. I got that worked out as well.
> x <- c("gender_gender-1", "county-county-2", "country country-1997",
+ "country,country-1993")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-2" "country-1997" "country-1993"
So far, total victory.
Now, what is the correct fix if there are three repeats in some cases? In this one, I want "country-country-country" to become just one "country".
> x <- c("gender_gender-1", "county-county-county-2")
> gsub("^(.*?)[,_\ -]\\1", "\\1", x)
[1] "gender-1" "county-county-2"
I am willing to replace all of the separators by "_" if that makes it easier to get rid of the repeat words.