What is the best way of splitting up a string by capital letters in C#?
Example:
HelloStackOverflow Users.How Are you doing?
Expected result:
Hello Stack Overflow Users. How are you doing?
What is the best way of splitting up a string by capital letters in C#?
Example:
HelloStackOverflow Users.How Are you doing?
Expected result:
Hello Stack Overflow Users. How are you doing?
You can use a regex:
static readonly Regex splitter = new Regex(@"\s+|(?=\s*[A-Z]+)|(?<=[,.?!])");
var spacedOut = splitter.Replace(str, " ");
This uses a lookahead to match the spot before a capital letter (with \s*
to swallow the whitespace).
It uses a lookbehind to match the spot after punctuation.
It depends how you define "best".
Unless you want a trivial implementation (blindly insert a space in front of every uppercase letter), I'd avoid regex and just write the few lines of code that do precisely what I need - create a destination StringBuilder, do a foreach through the characters of the string, copying characters across and inserting extra spaces when appropriate - you'll just need to keep a state variable to know if the previous character was uppercase. This will make it easy to handle all the possible special cases (first character is uppercase, acronyms, characters following punctuation or whitespace, single words like "A", culture-sensitive handling, etc).
Why wouldn't I use regex?
Firstly, if you want to handle all the special cases well, you'll probably need quite advaned regex skills, and the result will be an undecipherable "magic string" (difficult to read/maintain, as perfectly demonstrated by @Slaks IMHO - can you read and understand his regex in under 10 seconds?). A simple loop will be much easier to write, test, debug, read and upgrade unless you (and anyone else who might have to read/maintain your code in future) have been doing regexes for years.
Secondly, a loop through the characters is very simple. The regex will almost certainly be slower due to the higher level of generalisation it provides. This may or may not be an issue for you, but efficiency could be a significant factor when definiing "best".
Thirdly, I'm an old dog and I don't see much point in using clever new tricks to solve problems that a simple for loop can handle :-) ... I often see programmers using "cool" obfuscated LINQ queries and Regexes in place of a simple 2-or-3-line loop, and it makes me think of the old adage "to a man with a hammer, everything looks like a nail". Regex, like all tools, has its place. And I'm not convinced this justifies anything that complex.
I'm an oldschool guy, I would write it using StringBuilder
because I do not speak regexish:
var sb = new StringBuilder(input.Length);
int nextIndexToAdd = 0;
for (int i = 1; i < input.Length;i++ )
if (char.IsUpper(input[i])
&& !char.IsWhiteSpace(input[i - 1])
&& (!char.IsUpper(input[i - 1]) || (i < input.Length - 1 && !char.IsUpper(input[i + 1]))))
{
sb.Append(input.Substring(nextIndexToAdd, i - nextIndexToAdd));
sb.Append(" ");
nextIndexToAdd = i;
}
sb.Append(input.Substring(nextIndexToAdd));
string result = sb.ToString();
This handles both IAmFromUSA
and HelloStack...