0

Can any one simplify my regex? I have designed it after many tests and tried many things. Please don't simplify according to JS rules they seems to be working different. otherwise i would have done that myself.

"^[M]{0,3}([C]{1}[M]{1}){0,1}[D]{0,3}([C]{1}[D]{1}){0,1}[C]{0,3}([X]{1}[C]{1}){0,1}[L]{0,3}([X]{1}[L]{1}){0,1}[X]{0,3}([I]{1}[X]{1}){0,1}[V]{0,3}([I]{1}[V]{1}){0,1}[I]{0,3}$"

All characters with sequence are compulsory.

Adding some rules. This one is for some roman number system as per my requirements...

Numbers are formed by combining symbols together and adding the values. For example, MMVI is 1000 + 1000 + 5 + 1 = 2006. Generally, symbols are placed in order of value, starting with the largest values. When smaller values precede larger values, the smaller values are subtracted from the larger values, and the result is added to the total. For example MCMXLIV = 1000 + (1000 − 100) + (50 − 10) + (5 − 1) = 1944.

The symbols "I", "X", "C", and "M" can be repeated three times in succession, but no more. (They may appear four times if the third and fourth are separated by a smaller value, such as XXXIX.) "D", "L", and "V" can never be repeated. "I" can be subtracted from "V" and "X" only. "X" can be subtracted from "L" and "C" only. "C" can be subtracted from "D" and "M" only. "V", "L", and "D" can never be subtracted.

Only one small-value symbol may be subtracted from any large-value symbol. A number written in [16]Arabic numerals can be broken into digits. For example, 1903 is composed of 1, 9, 0, and 3. To write the Roman numeral, each of the non-zero digits should be treated separately. Inthe above example, 1,000 = M, 900 = CM, and 3 = III. Therefore, 1903 = MCMIII.

  • 6
    Remove all `{1}`, turn `[C]` to `C`, `{0,1}` to `?`. – Wiktor Stribiżew Sep 06 '16 at 16:10
  • 1
    With long regular expressions like this I find it helpful to break it down into "groups" using string constants and then assemble it. As written it's difficult to tell what you're trying to achieve here. Can you give some examples of valid and invalid strings? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 16:11
  • 1
    Why are you using character classes of only a single character? How is that different from just using the single character? – adv12 Sep 06 '16 at 16:12
  • 1
    I concur with @adv12 here. For example, you can simplify ([C]{1}[M]{1}){0,1} to simply (CM)? - no need for character classes, get rid of {1}, and simplify {0, 1} to ? - because you're looking for the literal string "CM". – EJoshuaS - Stand with Ukraine Sep 06 '16 at 16:14
  • 1
    Roman numerals? Related: http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression – Sebastian Proske Sep 06 '16 at 16:20
  • `"^M{0,3}(CM)?D{0,3}(CD)?C{0,3}(XC)?L{0,3}(XL)?X{0,3}(IX)?V{0,3}(IV)?I{0,3}$"` – Happy Green Kid Naps Sep 06 '16 at 16:21
  • What do you mean by "All characters with sequence are compulsory"? Right now *none* of the characters in your regular expression are compulsory - in fact, it'll match the empty string. – EJoshuaS - Stand with Ukraine Sep 06 '16 at 16:32

2 Answers2

3

A few points:

  • No need for character classes with only one item, so "[M]" can be replaced with "M" (for example)
  • "{0, 1}" can always be replaced with "?" without changing the meaning of the regex
  • You never need to include "{1}" as it doesn't add any additional constraints
  • For long regular expressions I suggest breaking the regex down into logical "subgroups" using string constants and "build" the regex with them - it's easier to read
  • Always include comments above the regular expression explaining its purpose and giving examples of valid and invalid inputs (unless it's short enough to be obvious), otherwise it'll be difficult to maintain

I haven't tested this as thoroughly as I'd like (it would be easier to do so given some examples of valid and invalid strings) but here's a stab at it:

"^M{0,3}(CM)?D{0,3}(CD)?C{0,3}(XC)?L{0,3}(XL)?X{0,3}(IX)?V{0,3}(IV)?I{0,3}$"

This'll match the string "MDCLXVI" but not something like "MMMMDCLXVI".

With that said, I suspect that your original regex isn't doing exactly what you intended it to, so this may not be only a problem of simplification. For example, you state in your post that "All characters with sequence are compulsory", but right now no particular sequence of strings is required; in fact, the regex will even match the empty string, which I suspect isn't what you want.

  • When i try (CM), it is not validating it as "CM" it is validating "C" and "M" both...Already tried this approach – Vijay Vasudevbhai Gurunanee Sep 06 '16 at 16:26
  • If you want C, M, or both you can do C?M? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 16:29
  • @VijayGurunanee "C" alone and "M" alone *are*, in fact, valid inputs to this regex as written. If you try "(CM)?" itself alone in Regex Hero or another regex tester, however, it will not match simply "C" or "M". Can you clarify what the problem is here? What's an example of a string that this regex accepts that you think it should be rejecting? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 17:28
  • @VijayGurunanee Note that you have M{0,3} and C{0,3}, so "M" and "C" are both perfectly valid strings. If you give one of these strings to the regex I posted it'll match one of those two constraints, *not* (CM)? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 17:30
  • Largest acceptable string is "MMMCMDDDCDCCCXCLLLXLXXXIXVVVIVIII". all others are sub-strings of this with same sequence. Now, one rule is M can come four times. Three continuous occurrence and one M must be after C (if we want). So MMMCM can be valid but MMMC is not valid. when we say (CM)? it is checking if only C or only M. i just tried before posting this question. romanRegEx.IsMatch("CM") is giving me False in C# program. :( – Vijay Vasudevbhai Gurunanee Sep 06 '16 at 18:01
0

This equation cannot be simplified for now because i am trying to validate string in C# regex processing. I have tried many other ways also including suggestion provided above.

So closing this question for now.