2

I need to identify substrings found in a string such as:

"CityABCProcess Test" or "cityABCProcess Test"

to yield : [ "City/city", "ABC", "Process", "Test" ]

  1. The first string in the substring can be lowercase or uppercase
  2. Any substring with recurring uppercase letters will be a substring until a lowercase letter or space is found "ABCProcess -> ABC, ABC Process -> ABC"
  3. If there is an uppercase letter followed by a lowercase letter the substring will be everything until the next uppercase letter.

Can this be handled by regex? Or should I convert my strings to a character array and manually check these cases using some indexing logic. Would a lambda solution work here? What is the best way to go about this?

Pipeline
  • 1,029
  • 1
  • 17
  • 45
  • 3
    This is going to be largely to your opinion, but IMO, when in doubt, don't use regex. It may be faster (and if speed is of a huge concern, then it might be worth considering) but maintaining it is a headache usually. – user2366842 Jul 17 '15 at 15:31
  • 2
    [Now you have two problems](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/). – Phylogenesis Jul 17 '15 at 15:36
  • "\\p{Lu}+" would be starting point of your regex... But it likely will be easier to just write code by hand. (Note that string is already indexable sequence of characters)... http://stackoverflow.com/questions/18125738/is-there-a-better-way-to-create-acronym-from-upper-letters-in-c may be of help. – Alexei Levenkov Jul 17 '15 at 15:43
  • Implement a method that loops all characters in a `for-loop` and fills a `StringBuilder`. – Tim Schmelter Jul 17 '15 at 15:44
  • @user2366842: in most cases regex is the slowest option. – Tim Schmelter Jul 17 '15 at 15:46
  • Hmm...fair enough. Then again, I don't try and include a bunch of logic on the rare occasions I do end up using regex, so there might be something to that. – user2366842 Jul 17 '15 at 15:50

1 Answers1

3

Pay no attention to the naysayers! Even something like this really isn't that complicated with RegEx. I believe this pattern should do the trick:

[A-Z][a-z]+|[A-Z]+\b|[A-Z]+(?=[A-Z])|[a-z]+

See here for a working demonstration. It's just a bunch of OR's processed in order. Here's the breakdown:

  • [A-Z][a-z]+ - Any word that starts with an uppercase letter and then is followed by all lowercase letters
  • [A-Z]+\b - Any word that is in all uppercase (so as to include the last uppercase letter which would be excluded in the following option)
  • [A-Z]+(?=[A-Z]) - Any word that is in all uppercase, but not including the first uppercase letter of the next word
  • [a-z]+ - Any word that's all lowercase

For instance:

string input = "CityABCProcess TEST";
StringBuilder builder = new StringBuilder();
builder.Append("[A-Z][a-z]+");
builder.Append("|");
builder.Append("[A-Z]+$");
builder.Append("|");
builder.Append("[A-Z]+(?=[A-Z])");
builder.Append("|");
builder.Append("[a-z]+");
foreach (Match m in Regex.Matches(input, builder.ToString()))
    {
    Console.WriteLine(m.Value);
    }
Steven Doggart
  • 43,358
  • 8
  • 68
  • 105