5

I have a column for residential adresses in my dataset 'ad'. I want to check for addresses which has no numbers(including roman numerals) present. I'm using

ad$check <- grepl("[[:digit:]]",ad$address)

to flag out addresses with no digits present. How do I do the same with addresses that contain roman numerals?

Eg: "floor X, DLF Building- III, ABC City"

pogibas
  • 27,303
  • 19
  • 84
  • 117
Priya T
  • 63
  • 5
  • is the structure within our address column homogeneous? E.g. is it always _Floor #, Building #, City_? And are the numbers followed by commas? – Val Mar 07 '18 at 09:11
  • 3
    https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression?rq=1 could be helpful – De Novo Mar 07 '18 at 09:14
  • No it is not homogeneous. Numbers may or may not be followed by commas. it can be "III DLF XI floor" as well. – Priya T Mar 07 '18 at 09:19

1 Answers1

1

You need to make a regex string.

Edit (my first answer was nonsense):

x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)
[1] FALSE  TRUE

To break it down:

\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

Please test on your data and report if it works.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • as you say, it would depend on the data, but white space might be the solution here. the likelihood that the data contains a "V" that is not a roman numeral, for example, is much higher than whether it contains a " V " that is not a roman numeral. – De Novo Mar 07 '18 at 09:50
  • The above check holds true for all addresses (whether they have roman numerals or not). If an address contains roman numerals, it will have preceding and trailing white space. – Priya T Mar 07 '18 at 10:07
  • Fist answer was nonsense. The answer I linked was concerned with the order of the letters, not if they are in a string. New answer should work and was tested on the example you gave and one I added without roman numerals. – JBGruber Mar 07 '18 at 10:32