How to divide a vector into multiple groups using regex?

Question

I failed to adapt this solution to group a vector by regular expressions for multiple groups and can't figure out what I'm doing wrong. Another solution didn't help me either.

x1 <- gsub(paste0("(^a?A?pr)|(^a?A?ug)|(d?D?ec)"),
           "\\1 \\2 \\3", x)
> unique(x1)
[1] "  dec" "Apr  " " aug " "apr  " "  Dec" " Aug "

I expected three unique groups as I have defined them in the gsub, i.e. just something like "dec Dec", "aug Aug", "apr Apr".

With more than 9 groups it's even worse.

y1 <- gsub(paste0("(^a?A?pr)|(^a?A?ug)|(d?D?ec)|(^f?F?eb)|(^j?J?an)|(^j?J?ul)|", 
                  "(^j?J?un)|(^m?M?ar)|(^m?M?ay)|(^n?|N?ov)|(^o?O?ct)|(^s?S?ep)"),
           "\\1 \\2 \\3 \\4 \\5 \\6 \\7 \\8 \\9 \\10 \\11 \\12", y)
> unique(y1)
 [1] "         0 1 2"             "      jun   0 1 2"         
 [3] "     jul    0 1 2"          " Aug        0 1 2"         
 [5] "     Jul    0 1 2"          "   feb      0 1 2"         
 [7] "      Jun   0 1 2"          "       Mar  0 1 2"         
 [9] "    jan     0 1 2"          "Apr         Apr0 Apr1 Apr2"
[11] "  dec       0 1 2"          "   Feb      0 1 2"         
[13] "  Dec       0 1 2"          "apr         apr0 apr1 apr2"
[15] " aug        0 1 2"

As the final result I aim for a factorized vector with unique levels for the different appearances of the same type (i.e. in this example a group for each month name, not case-sensitive).

Edit

My application has less to do with month names and just upper/lower case, my groups are more complicated. The data are OCR-generated and therefore slightly destroyed. I try to make another example, that should illustrate my problem:

z1 <- gsub(paste0("(^0?O?c?i?t)|(^5?S?ep?P?)|(^D?d?8?o?e?c?o?)|(^a?A?p.)|",
                  "(^A?u.)|(F?f?E?e?b)|(^J?I?ul|ju1)|(J?j?u?2?n?2?)|(^N.+)|(^May)"),
           "\\1 \\2 \\3 \\4 \\5 \\6 \\7 \\8 \\9 \\10", z)
> unique(z1)
 [1] "Oit         Oit0" "       ju2  0"    "0ct         0ct0" "      ju1   0"   
 [5] "    Au9     0"    "      Iul   0"    " Sep        0"    "      Jul   0"   
 [9] "     feb    0"    "       Jun  0"    "Oct         Oct0" "  8oc       0"   
[13] "     Eeb    0"    "        Nov 0"    "     Feb    0"    "  deo       0"   
[17] "   Apv      0"    "  Dec       0"    "       j2n  0"    "         0"      
[21] "   apr      0"    "    Aug     0"    " 5eP        0"

The different forms of month names are not in those groups that I have defined in the gsub regex. Also group names with more than one digit as \\10 seem to make problems (compare to case x).

How can I do the gsub correctly so that my groups defined in the regex are recognized uniquely?

Data

x <- c("dec", "Apr", "dec", "aug", "dec", "dec", "Apr", "apr", "apr", 
"dec", "Dec", "Aug", "Aug", "Apr", "Aug", "Apr", "aug", "Apr", 
"apr", "Apr", "dec", "aug", "aug", "aug", "aug", "apr", "dec", 
"Aug", "dec", "dec", "Dec", "Dec", "Apr", "Apr", "dec", "dec", 
"Dec", "dec", "apr", "Apr", "Apr", "dec", "apr", "apr", "apr", 
"apr", "Aug", "apr", "dec", "dec")

y <- c("Oct", "jun", "oct", "jul", "Aug", "jul", "Sep", "Jul", "feb", 
"feb", "Jun", "Mar", "jan", "Apr", "jul", "oct", "Jun", "jan", 
"Jun", "Oct", "Jul", "dec", "Jun", "Sep", "Feb", "Nov", "Feb", 
"dec", "Apr", "Dec", "jan", "Aug", "Feb", "apr", "Sep", "Nov", 
"aug", "oct", "Jun", "jul", "Apr", "Jun", "Apr", "Dec", "Jun", 
"Jul", "Aug", "Aug", "Jul", "sep")

z <- c("Oit", "ju2", "0ct", "ju1", "Au9", "Iul", "Sep", "Jul", "feb", 
       "Jun", "Oct", "Jul", "8oc", "Jun", "Sep", "Eeb", "Nov", "Feb", 
       "deo", "Apv", "Dec", "j2n", "May", "Feb", "apr", "Sep", "Nov", 
       "Jul", "Aug", "Aug", "Jul", "5eP")

Something like `sp <- split(tolower(x), tolower(x)); f <- factor(match(tolower(x), names(sp)))`? — Rui Barradas, Mar 10 '19 at 17:22
@RuiBarradas Thx, but not really; it's a simplified example, my real data is more complicated. Also the order should stay the same. The [regex tester](https://regex101.com/) recognizes my groups but `gsub` won't - that's my problem. — jay.sf, Mar 10 '19 at 17:30

IRTFM · Answer 1 · 2019-03-11T18:18:10.580

With the much more complex task now on the table, I would build a set of patterns for each month. Clearly some of those entries, e.g. "ju2", are ambiguous and you problem definition is lacking in direction for that case. I suppose that "Oit" is really Oct". So I would first build a correctly spelled vector that can latter be "augmented" or modified to account for ambiguity:

(pats <- sapply(  as.data.frame(  t(matrix( c( month.abb, tolower(month.abb)), ncol=2))) , paste0,collapse="|" ) )
       V1        V2        V3        V4        V5        V6        V7        V8        V9 
"Jan|jan" "Feb|feb" "Mar|mar" "Apr|apr" "May|may" "Jun|jun" "Jul|jul" "Aug|aug" "Sep|sep" 
      V10       V11       V12 
"Oct|oct" "Nov|nov" "Dec|dec" 
pats[1] <- "Jan|jan|Ja|ja"  
 # add generality (that's actually redundant)
 # and might better "Ja.|ja.|.an"
pats[10] <- "Oct|oct|Oit|oc|Oc"
# or more compactly:  "OC|oc|O.t"

Then you can run these "generalized patterns in a loop to correct the entries:

 zcopy <- z
for( p in seq_along(pats) ) { 
         zcopy[grepl( pats[p], z)] <- month.abb[p] }
#------------------------
> zcopy
 [1] "Oct" "ju2" "0ct" "ju1" "Au9" "Iul" "Sep" "Jul" "Feb" "Jun" "Oct" "Jul" "Oct" "Jun" "Sep"
[16] "Eeb" "Nov" "Feb" "deo" "Apv" "Dec" "j2n" "May" "Feb" "Apr" "Sep" "Nov" "Jul" "Aug" "Aug"
[31] "Jul" "5eP"

You will need to decide how general to make this, i.e, do you just want to add "5ep" to the September pattern, or should it be ".ep"? But I think I have delivered a fairly compact bit of code for what is a fairly complex task.

If you want to make a character position completely wildcard than you can use a period in the pattern, eg. you ight decide to let any letter followed by "ul" to be an acceptable hit for July and then just add ".ul" to that pattern string (with a "|" pipe separator of course.

-----------old answer--

I'm not sure I understand, but if you just want a "regularized" alphabetization of month abbreviations then you can use match on the "lowercased" vector against the system vector month.abb:

month.abb[ match(tolower(x), tolower(month.abb) )]
 [1] "Dec" "Apr" "Dec" "Aug" "Dec" "Dec" "Apr" "Apr" "Apr" "Dec" "Dec"
[12] "Aug" "Aug" "Apr" "Aug" "Apr" "Aug" "Apr" "Apr" "Apr" "Dec" "Aug"
[23] "Aug" "Aug" "Aug" "Apr" "Dec" "Aug" "Dec" "Dec" "Dec" "Dec" "Apr"
[34] "Apr" "Dec" "Dec" "Dec" "Dec" "Apr" "Apr" "Apr" "Dec" "Apr" "Apr"
[45] "Apr" "Apr" "Aug" "Apr" "Dec" "Dec"

Obviously this can be made into a factor with the factor function but should probably have the levels set in the right order:

factor( month.abb[ match(tolower(x), tolower(month.abb) )], levels=month.abb)
 [1] Dec Apr Dec Aug Dec Dec Apr Apr Apr Dec Dec Aug
[13] Aug Apr Aug Apr Aug Apr Apr Apr Dec Aug Aug Aug
[25] Aug Apr Dec Aug Dec Dec Dec Dec Apr Apr Dec Dec
[37] Dec Dec Apr Apr Apr Dec Apr Apr Apr Apr Aug Apr
[49] Dec Dec
12 Levels: Jan Feb Mar Apr May Jun Jul Aug ... Dec

Ay me, obviously I have minimalized my problem way too much!! I hope my edit sheds a little more light on what I need? — jay.sf, Mar 11 '19 at 06:25
That doesn't explain how/if it works with the `gsub`, but your edit definitely pushed me in the right direction to solve my problem, thanks for that! — jay.sf, Mar 11 '19 at 21:21
To be honest I could not make heads or tails of your `gsub` proposition. If you did sove it with somethiong like that then you should post an answer. That's completely acceptable here. If it was useful you might upvote mine and after a suitable interval checkmark yours. — IRTFM, Mar 11 '19 at 21:25
I followed your advice to attempt an own answer, with `grep` though. — jay.sf, Mar 19 '19 at 19:06
@jay.sf : As questioner you have the capacity to both upvote and bestow checkmarks which results in a 150% increase in delivered rep. But then you can't upvote your answer which you might reasonably choose to do instead, — IRTFM, Mar 19 '19 at 19:12

Andrew · Answer 2 · 2019-03-11T14:46:10.300

The first solution works for your example but it probably does not solve your problem (i.e., it is not a regex solution). But it does work for x and y :), I am not sure exactly what you want from z. This basically identifies duplicates in the vector and pastes them together. Right now it only works when there are duplicates but it is adaptable to more than one duplicate (i.e., c("sep", "Sep", "seP").

# For y
y_sort <- sort(unique(y))

#Extract single factors
table <- data.frame(table(tolower(y_sort)), stringsAsFactors = FALSE)
solo <- as.character(table[which(table$Freq < 2), ]$Var1)
y_sort_dups <- y_sort[!tolower(y_sort) %in% solo]

# Create indices for dups
rep_indices <- rle(tolower(y_sort_dups))$lengths

# Paste together dups
levels <- cumsum(rep_indices) - 1
dups <- unique(paste(y_sort_dups[levels], y_sort_dups[levels + 1], sep = " "))

# Add back in solo months
sort(c(dups, y_sort[tolower(y_sort) %in% solo]))
[1] "apr Apr" "aug Aug" "dec Dec" "feb Feb" "jan"     "jul Jul" "jun Jun" "Mar"     "Nov"     "oct Oct" "seP Sep"

However, if you are working with OCR generated data, why don't you clean it up a lot before creating your factors? Below I adapted some syntax I used for a similar project, it isn't perfect but you'll get the general idea. Instead of jarowinkler you could use levenshteinDist with min and which.min or some other distance measure. Hope this helps and best of luck!

# Cleaning up z
library(RecordLinkage)

# Vector with all values 
z_lower <- trimws(tolower(z))

# Vector with legitimate values (can add to, this was the quick way)
z_dups <- unique(c(unique(z_lower[duplicated(z_lower)]), tolower(month.abb)))

# Create df to viewing
df <- data.frame(z_lower = z_lower, stringsAsFactors = FALSE)

# Swap out numbers that look like letters
df$z_gsub <- gsub("0", "o", df$z_lower, fixed = TRUE)
df$z_gsub <- gsub("3", "e", df$z_gsub, fixed = TRUE)
df$z_gsub <- gsub("4", "a", df$z_gsub, fixed = TRUE)
df$z_gsub <- gsub("5", "s", df$z_gsub, fixed = TRUE)
df$z_gsub <- gsub("6", "g", df$z_gsub, fixed = TRUE)
df$z_gsub <- gsub("8", "b", df$z_gsub, fixed = TRUE)


df$distance <- sapply(df$z_gsub, function(x) max(jarowinkler(x, z_dups)))
df$match <- sapply(df$z_gsub, function(x) z_dups[which.max(jarowinkler(x, z_dups))])

> unique(df[order(df$distance), ])
   z_lower z_gsub  distance match
13     8oc    boc 0.5555556   nov
6      iul    iul 0.7777778   jul
16     eeb    eeb 0.7777778   feb
1      oit    oit 0.8000000   oct
22     j2n    j2n 0.8000000   jun
2      ju2    ju2 0.8222222   jul
4      ju1    ju1 0.8222222   jul
5      au9    au9 0.8222222   aug
19     deo    deo 0.8222222   dec
20     apv    apv 0.8222222   apr
3      0ct    oct 1.0000000   oct
7      sep    sep 1.0000000   sep
8      jul    jul 1.0000000   jul
9      feb    feb 1.0000000   feb
10     jun    jun 1.0000000   jun
11     oct    oct 1.0000000   oct
17     nov    nov 1.0000000   nov
21     dec    dec 1.0000000   dec
23     may    may 1.0000000   may
25     apr    apr 1.0000000   apr
29     aug    aug 1.0000000   aug
32     5ep    sep 1.0000000   sep

score 1 · Accepted Answer · answered Mar 19 '19 at 19:02

I could not (yet) get a gsub solution, but grep can give me what I want. I make another example with color names, since the month names were somewhat misleading.

Consider a column which values are known but due to OCR somewhat disrupted.

> dat$z1
 [1] "grcen"  "grey"   "b1ue"   "gree2"  "grey"   "bei9e"  "grey"   "beige" 
 [9] "b|ue"   "bcige"  "green"  "grey"   "giieen" "blue"   "belge"  "bliie"

First I create a vector rex with a regex for each case.

rex <- c("(^bl?1?\\|?u?i*?e$)", "(^be?c?i?l?g?9?.$)", "(^gr?i*c?e*n?2?$)", 
         "(^grey$)")

Then, I use grep to get a matrix M of the related positions

M <- sapply(rex, function(i) grep(i, dat$z1))

and unify them in a for loop by assigning category numbers:

for (j in seq(rex)) dat$z1[M[, j]] <- j

Finally I factorize the column and assign the correct labels for each category.

factor(dat$z1, labels=c("blue", "beige", "green", "gray"))
# [1] green gray  blue  green gray  beige gray  beige blue  beige green gray 
# [13] green blue  beige blue 
# Levels: blue beige green gray

Data

dat <- structure(list(z1 = c("grcen", "grey", "b1ue", "gree2", "grey", 
"bei9e", "grey", "beige", "b|ue", "bcige", "green", "grey", "giieen", 
"blue", "belge", "bliie")), class = "data.frame", row.names = c(NA, 
-16L))

How to divide a vector into multiple groups using regex?

3 Answers3