Simplify my regular expression (it's in C# so many suggestions are not working, I already tried)

Question

Can any one simplify my regex? I have designed it after many tests and tried many things. Please don't simplify according to JS rules they seems to be working different. otherwise i would have done that myself.

"^[M]{0,3}([C]{1}[M]{1}){0,1}[D]{0,3}([C]{1}[D]{1}){0,1}[C]{0,3}([X]{1}[C]{1}){0,1}[L]{0,3}([X]{1}[L]{1}){0,1}[X]{0,3}([I]{1}[X]{1}){0,1}[V]{0,3}([I]{1}[V]{1}){0,1}[I]{0,3}$"

All characters with sequence are compulsory.

Adding some rules. This one is for some roman number system as per my requirements...

Numbers are formed by combining symbols together and adding the values. For example, MMVI is 1000 + 1000 + 5 + 1 = 2006. Generally, symbols are placed in order of value, starting with the largest values. When smaller values precede larger values, the smaller values are subtracted from the larger values, and the result is added to the total. For example MCMXLIV = 1000 + (1000 − 100) + (50 − 10) + (5 − 1) = 1944.

The symbols "I", "X", "C", and "M" can be repeated three times in succession, but no more. (They may appear four times if the third and fourth are separated by a smaller value, such as XXXIX.) "D", "L", and "V" can never be repeated. "I" can be subtracted from "V" and "X" only. "X" can be subtracted from "L" and "C" only. "C" can be subtracted from "D" and "M" only. "V", "L", and "D" can never be subtracted.

Only one small-value symbol may be subtracted from any large-value symbol. A number written in [16]Arabic numerals can be broken into digits. For example, 1903 is composed of 1, 9, 0, and 3. To write the Roman numeral, each of the non-zero digits should be treated separately. Inthe above example, 1,000 = M, 900 = CM, and 3 = III. Therefore, 1903 = MCMIII.

With long regular expressions like this I find it helpful to break it down into "groups" using string constants and then assemble it. As written it's difficult to tell what you're trying to achieve here. Can you give some examples of valid and invalid strings? — EJoshuaS - Stand with Ukraine, Sep 06 '16 at 16:11
Why are you using character classes of only a single character? How is that different from just using the single character? — adv12, Sep 06 '16 at 16:12
I concur with @adv12 here. For example, you can simplify ([C]{1}[M]{1}){0,1} to simply (CM)? - no need for character classes, get rid of {1}, and simplify {0, 1} to ? - because you're looking for the literal string "CM". — EJoshuaS - Stand with Ukraine, Sep 06 '16 at 16:14
Roman numerals? Related: http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression — Sebastian Proske, Sep 06 '16 at 16:20
`"^M{0,3}(CM)?D{0,3}(CD)?C{0,3}(XC)?L{0,3}(XL)?X{0,3}(IX)?V{0,3}(IV)?I{0,3}$"` — Happy Green Kid Naps, Sep 06 '16 at 16:21
What do you mean by "All characters with sequence are compulsory"? Right now *none* of the characters in your regular expression are compulsory - in fact, it'll match the empty string. — EJoshuaS - Stand with Ukraine, Sep 06 '16 at 16:32

EJoshuaS - Stand with Ukraine · Answer 1 · 2016-09-06T17:23:15.747

3

A few points:

No need for character classes with only one item, so "[M]" can be replaced with "M" (for example)
"{0, 1}" can always be replaced with "?" without changing the meaning of the regex
You never need to include "{1}" as it doesn't add any additional constraints
For long regular expressions I suggest breaking the regex down into logical "subgroups" using string constants and "build" the regex with them - it's easier to read
Always include comments above the regular expression explaining its purpose and giving examples of valid and invalid inputs (unless it's short enough to be obvious), otherwise it'll be difficult to maintain

I haven't tested this as thoroughly as I'd like (it would be easier to do so given some examples of valid and invalid strings) but here's a stab at it:

"^M{0,3}(CM)?D{0,3}(CD)?C{0,3}(XC)?L{0,3}(XL)?X{0,3}(IX)?V{0,3}(IV)?I{0,3}$"

This'll match the string "MDCLXVI" but not something like "MMMMDCLXVI".

With that said, I suspect that your original regex isn't doing exactly what you intended it to, so this may not be only a problem of simplification. For example, you state in your post that "All characters with sequence are compulsory", but right now no particular sequence of strings is required; in fact, the regex will even match the empty string, which I suspect isn't what you want.

edited Sep 06 '16 at 17:23

answered Sep 06 '16 at 16:20

EJoshuaS - Stand with Ukraine

11,977
56
49
78

When i try (CM), it is not validating it as "CM" it is validating "C" and "M" both...Already tried this approach – Vijay Vasudevbhai Gurunanee Sep 06 '16 at 16:26
If you want C, M, or both you can do C?M? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 16:29
@VijayGurunanee "C" alone and "M" alone *are*, in fact, valid inputs to this regex as written. If you try "(CM)?" itself alone in Regex Hero or another regex tester, however, it will not match simply "C" or "M". Can you clarify what the problem is here? What's an example of a string that this regex accepts that you think it should be rejecting? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 17:28
@VijayGurunanee Note that you have M{0,3} and C{0,3}, so "M" and "C" are both perfectly valid strings. If you give one of these strings to the regex I posted it'll match one of those two constraints, *not* (CM)? – EJoshuaS - Stand with Ukraine Sep 06 '16 at 17:30
Largest acceptable string is "MMMCMDDDCDCCCXCLLLXLXXXIXVVVIVIII". all others are sub-strings of this with same sequence. Now, one rule is M can come four times. Three continuous occurrence and one M must be after C (if we want). So MMMCM can be valid but MMMC is not valid. when we say (CM)? it is checking if only C or only M. i just tried before posting this question. romanRegEx.IsMatch("CM") is giving me False in C# program. :( – Vijay Vasudevbhai Gurunanee Sep 06 '16 at 18:01

score 0 · Accepted Answer · answered Sep 07 '16 at 18:07

0

This equation cannot be simplified for now because i am trying to validate string in C# regex processing. I have tried many other ways also including suggestion provided above.

So closing this question for now.

answered Sep 07 '16 at 18:07

Vijay Vasudevbhai Gurunanee

96
6

Simplify my regular expression (it's in C# so many suggestions are not working, I already tried)

2 Answers2