0

I have a list of strings, an example is shown below (the actual list has a much bigger variety in format)

[1] "AB-123"
[2] "AB-312"
[3] "AB-546"
[4] "ZXC/123456"

Assuming [1] is the correct format, I want to extract the regular expression from [1] and match it against the rest to detect that [4] is inconsistent. Is there a method to do this or is there a better way to achieve the same outcome?

*EDIT - I found something close to what I require, anyone know of any packages that does this? Given a string, generate a regex that can parse *similar* strings

Community
  • 1
  • 1
Rui
  • 13
  • 3
  • "*I want to extract the regular expression from [1]*" - do you have any thoughts about how to do that? How are you defining 'consistent'? Same length? Same rough pattern either side of a `-`? Numbers vs letters comparison? – thelatemail Jan 25 '17 at 02:47
  • Do you need something like `grepl(substr(v1[1], 1, 2), v1[-1])` where `v1 <- c( "AB-123" , "AB-312" , "AB-546" , "ZXC/123456")` – akrun Jan 25 '17 at 02:54
  • @akrun yes I need sth like that, except that the format of the str might not always be starting with "AB". which is why i wanted to extract the regular expression from the str instead of specifying it – Rui Jan 25 '17 at 03:20
  • @thelatemail i thought consistent would be the length of characters, the position of alphabets, numbers or signs. I thought of getting a list of type after breaking down the str - "AB-123" = [char, char, sign, num, num, num]? Not too sure if that would work – Rui Jan 25 '17 at 03:21

2 Answers2

0

We may need grep

 grepl(sub("-.*", "", v1[1]), v1[-1])

data

v1 <- c( "AB-123" , "AB-312" ,  "AB-546" , "ZXC/123456")
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Here's an attempt at making a function which checks if each value is a Character Digit or Other. It is a bit rough but I'm sure this can be expanded upon to match exactly what you want:

test <- c("AB-123", "AB-312", "AB-546", "ZXC/123456")

compare_1st <- function(x) {
  x <- toupper(x)
  chars <- list("A",1,"-")
  repl  <- c("[A-Z]", "[0-9]", "[^0-9A-Z]")
  for(i in seq_along(repl)) x <- gsub(repl[i], chars[i], x)
  out <- x[1] == x
  attr(out, "values") <- chartr("A1-", "CDO", x)
  out
}

compare_1st(test)
#[1]  TRUE  TRUE  TRUE FALSE
#attr(,"values")
#[1] "CCODDD"     "CCODDD"     "CCODDD"     "CCCODDDDDD"
thelatemail
  • 91,185
  • 12
  • 128
  • 188