-2

I'm trying to create a regex pattern (one or more). For instance having SomeCamelStringToCombine it should match following substrings:

Some, Camel, String, To, Combine, SomeCamel, SomeCamelString,SomeCamelStringTo, SomeCamelStringToCombine, CamelString, CamelStringTo, CamelStringToCombine, StringTo, StringToCombine, ToCombine.

I managed to create this pattern: /(?=([\p{Lu}]+[\p{L}]+))/, but it matches

SomeCamelStringToCombine, CamelStringToCombine, StringToCombine, ToCombine, Combine.

I don't know whether I should modify it or create extra patterns. The problem is I do not know how. I'm using Java for a matching.

Can I ask you for help or tips?

Alexey
  • 2,542
  • 4
  • 31
  • 53
  • Duplicate of https://stackoverflow.com/questions/1128305/regular-expression-to-identify-camelcased-words-with-leading-uppercase-letter – Arpit Jul 26 '17 at 15:59
  • 1
    @Arpit: I don't think so, read the question carefully. – T.J. Crowder Jul 26 '17 at 16:00
  • 1
    I'll go out on a limb here and say you can't do it with *just* a regex. But a regex identifying the pieces combined with a loop to (re)create the combinations should be straightforward enough. – T.J. Crowder Jul 26 '17 at 16:00
  • @T.J.Crowder apologise for the incorrect comment. And I second you on regex not being the way to do that. Maybe split the string and match with an array of Strings. – Arpit Jul 26 '17 at 16:04
  • This would be vary expensive to do in regex, and I guarantee you whatever regex you come up with will be wrong in some way. It would be much better to just create a index list of the capital letters in the string, do a nested loop over the list, and take all the valid sub-strings from that. Much easier to do and much less work to make mistakes in. – Tezra Jul 26 '17 at 16:12
  • I'm confused. What are the inputs? Do you seed it with the string `SomeCamelStringToCombine` and want to *build* a pattern that can only match those specific substrings, given that sample seed? – Andreas Jul 26 '17 at 16:21
  • To rephrase, I read your question as "E.g. having `SomeCamelStringToCombine` it should match e.g. `CamelString`, but not e.g. `SomeString`". Is that a correct interpretation of your question? Or are you trying to say that having `SomeCamelStringToCombine`, you want to *extract* all the listed combinations from that string, i.e. you're not *matching* anything, but building specific substrings of that string? – Andreas Jul 26 '17 at 16:37

1 Answers1

0

You could make a fixed size regex to find up to that many word combinations.
Below uses 5 words worth of captures, but you could extend it to any size.

You could easily create the regex programmatically.

Just exclude empty capture groups from the array.

Note, after the first match, you can also exclude the 1-5 groups to avoid
duplicate singles.

(?=([A-Z][a-z]+)([A-Z][a-z]+)([A-Z][a-z]+)?([A-Z][a-z]+)?([A-Z][a-z]+)?)(?=(\1\2))(?=(\6\3)?)(?=(\7\4)?)(?=(\8\5)?)\1

https://regex101.com/r/ta9Qzq/1

 (?=
      ( [A-Z] [a-z]+ )              # (1), required Word 1
      ( [A-Z] [a-z]+ )              # (2), required Word 2
      ( [A-Z] [a-z]+ )?             # (3), optional Word 3
      ( [A-Z] [a-z]+ )?             # (4), optional Word 4
      ( [A-Z] [a-z]+ )?             # (5), optional Word 5
 )
 (?=
      ( \1 \2 )                     # (6), required Word 1,2
 )
 (?=
      ( \6 \3 )?                    # (7), optional Word 1,2,3
 )
 (?=
      ( \7 \4 )?                    # (8), optional Word 1,2,3,4
 )
 (?=
      ( \8 \5 )?                    # (9), optional Word 1,2,3,4,5
 )
 \1                            # Advance position by 1 word
  • How is this an answer to "e.g. having `SomeCamelStringToCombine` it should match e.g. `CamelString`, but not `SomeString`"? – Andreas Jul 26 '17 at 16:36
  • @Andreas - In a global match, it produces the array `Some, Camel, String, To, Combine, SomeCamel, SomeCamelString,SomeCamelStringTo, SomeCamelStringToCombine, CamelString, CamelStringTo, CamelStringToCombine, StringTo, StringToCombine, ToCombine` _what's the problem ??? https://regex101.com/r/ta9Qzq/1 –  Jul 26 '17 at 16:39
  • Guess I didn't understand the question. Still not sure I do. However, you say *"it produces the array"*, and it doesn't. At least not without extra Java code to combine all the various captured groups from the repeated `find()` calls. You also say *"you could extend it to any size"*, but that isn't entirely true, since you can extend it to any *particular* maximum size, but not to support *any* (unlimited) size. – Andreas Jul 26 '17 at 16:46
  • @Andreas - Nah, I wouldn't put it to code just yet, would you? As for extending the size, it's limited by the amount of capture groups that Java supports. I guess if it supports 25,000, you'd end up with a 2-3 MB regex. –  Jul 26 '17 at 16:53
  • @Andreas - Oh, also there is some sort of n-factorial operation going on here. It be hard to imagine _unlimited_ size. –  Jul 26 '17 at 16:58
  • Yeah, rather than trying to do it all with regex, just split the input into the words, e.g. using `(?<=.)(?=\p{Lu})` *(split before uppercase letter that is not the first letter)*, and then it's a standard combinations problem, easy to implement in Java. [Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.](https://en.wikiquote.org/wiki/Jamie_Zawinski) – Andreas Jul 26 '17 at 17:13
  • @Andreas - Yeah, I would do it that way too. But, some people are adamant. The want to see some cataclysmic, mind bending, hand wringing, head exploding regex... Like this _[175,000 Word Dictionary](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/_ASCII_175,000_word_Mix_A-Z_Multi_Lined.txt)_. _Now they have 3 problems._ –  Jul 26 '17 at 17:32