I failed to adapt this solution to group a vector by regular expressions for multiple groups and can't figure out what I'm doing wrong. Another solution didn't help me either.
x1 <- gsub(paste0("(^a?A?pr)|(^a?A?ug)|(d?D?ec)"),
"\\1 \\2 \\3", x)
> unique(x1)
[1] " dec" "Apr " " aug " "apr " " Dec" " Aug "
I expected three unique groups as I have defined them in the gsub
, i.e. just something like "dec Dec", "aug Aug", "apr Apr"
.
With more than 9 groups it's even worse.
y1 <- gsub(paste0("(^a?A?pr)|(^a?A?ug)|(d?D?ec)|(^f?F?eb)|(^j?J?an)|(^j?J?ul)|",
"(^j?J?un)|(^m?M?ar)|(^m?M?ay)|(^n?|N?ov)|(^o?O?ct)|(^s?S?ep)"),
"\\1 \\2 \\3 \\4 \\5 \\6 \\7 \\8 \\9 \\10 \\11 \\12", y)
> unique(y1)
[1] " 0 1 2" " jun 0 1 2"
[3] " jul 0 1 2" " Aug 0 1 2"
[5] " Jul 0 1 2" " feb 0 1 2"
[7] " Jun 0 1 2" " Mar 0 1 2"
[9] " jan 0 1 2" "Apr Apr0 Apr1 Apr2"
[11] " dec 0 1 2" " Feb 0 1 2"
[13] " Dec 0 1 2" "apr apr0 apr1 apr2"
[15] " aug 0 1 2"
As the final result I aim for a factorized vector with unique levels for the different appearances of the same type (i.e. in this example a group for each month name, not case-sensitive).
Edit
My application has less to do with month names and just upper/lower case, my groups are more complicated. The data are OCR-generated and therefore slightly destroyed. I try to make another example, that should illustrate my problem:
z1 <- gsub(paste0("(^0?O?c?i?t)|(^5?S?ep?P?)|(^D?d?8?o?e?c?o?)|(^a?A?p.)|",
"(^A?u.)|(F?f?E?e?b)|(^J?I?ul|ju1)|(J?j?u?2?n?2?)|(^N.+)|(^May)"),
"\\1 \\2 \\3 \\4 \\5 \\6 \\7 \\8 \\9 \\10", z)
> unique(z1)
[1] "Oit Oit0" " ju2 0" "0ct 0ct0" " ju1 0"
[5] " Au9 0" " Iul 0" " Sep 0" " Jul 0"
[9] " feb 0" " Jun 0" "Oct Oct0" " 8oc 0"
[13] " Eeb 0" " Nov 0" " Feb 0" " deo 0"
[17] " Apv 0" " Dec 0" " j2n 0" " 0"
[21] " apr 0" " Aug 0" " 5eP 0"
The different forms of month names are not in those groups that I have defined in the gsub
regex. Also group names with more than one digit as \\10
seem to make problems (compare to case x
).
How can I do the gsub
correctly so that my groups defined in the regex are recognized uniquely?
Data
x <- c("dec", "Apr", "dec", "aug", "dec", "dec", "Apr", "apr", "apr",
"dec", "Dec", "Aug", "Aug", "Apr", "Aug", "Apr", "aug", "Apr",
"apr", "Apr", "dec", "aug", "aug", "aug", "aug", "apr", "dec",
"Aug", "dec", "dec", "Dec", "Dec", "Apr", "Apr", "dec", "dec",
"Dec", "dec", "apr", "Apr", "Apr", "dec", "apr", "apr", "apr",
"apr", "Aug", "apr", "dec", "dec")
y <- c("Oct", "jun", "oct", "jul", "Aug", "jul", "Sep", "Jul", "feb",
"feb", "Jun", "Mar", "jan", "Apr", "jul", "oct", "Jun", "jan",
"Jun", "Oct", "Jul", "dec", "Jun", "Sep", "Feb", "Nov", "Feb",
"dec", "Apr", "Dec", "jan", "Aug", "Feb", "apr", "Sep", "Nov",
"aug", "oct", "Jun", "jul", "Apr", "Jun", "Apr", "Dec", "Jun",
"Jul", "Aug", "Aug", "Jul", "sep")
z <- c("Oit", "ju2", "0ct", "ju1", "Au9", "Iul", "Sep", "Jul", "feb",
"Jun", "Oct", "Jul", "8oc", "Jun", "Sep", "Eeb", "Nov", "Feb",
"deo", "Apv", "Dec", "j2n", "May", "Feb", "apr", "Sep", "Nov",
"Jul", "Aug", "Aug", "Jul", "5eP")