40

I'm using the regex

System.Text.RegularExpressions.Regex.Replace(stringToSplit, "([A-Z])", " $1").Trim()

to split strings by capital letter, for example:

'MyNameIsSimon' becomes 'My Name Is Simon'

I find this incredibly useful when working with enumerations. What I would like to do is change it slightly so that strings are only split if the next letter is a lowercase letter, for example:

'USAToday' would become 'USA Today'

Can this be done?

EDIT: Thanks to all for responding. I may not have entirely thought this through, in some cases 'A' and 'I' would need to be ignored but this is not possible (at least not in a meaningful way). In my case though the answers below do what I need. Thanks!

Simon
  • 6,062
  • 13
  • 60
  • 97
  • 1
    Hmmm... this might not be as simple as initially thought - what about a string like "TodayILiveInTheUSAWithSimon" - both current answers will fail for this. – Peter Boughton Jul 08 '09 at 13:05
  • Good point. I can probably work around that though in this instance. – Simon Jul 08 '09 at 13:15

7 Answers7

56
((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))

or its Unicode-aware cousin

((?<=\p{Ll})\p{Lu}|\p{Lu}(?=\p{Ll}))

when replaced globally with

" $1"

handles

TodayILiveInTheUSAWithSimon
USAToday
IAmSOOOBored

yielding

 Today I Live In The USA With Simon
USA Today
I Am SOOO Bored

In a second step you'd have to trim the string.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Sorry, you lost me a bit! Like this: Replace(stringToSplit, "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " \1") ? – Simon Jul 08 '09 at 13:33
  • The `\1` means back-reference #1. In .NET regexes, this is expressed as `$1`. Other than that, your statement seems correct. – Tomalak Jul 08 '09 at 13:47
  • I've edited the answer so it uses the .NET style back-reference. – Tomalak Jul 08 '09 at 14:01
  • 5
    `([A-Z])(?<=[a-z]\1|[A-Za-z]\1(?=[a-z]))` doesn't add the space at the beginning because it can never match the first letter. :) – Alan Moore Dec 19 '09 at 05:18
  • 7
    Converted to string extension method: `public static string SeperateCamelCase(this string value) { return Regex.Replace(value, "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1"); }` – Tr1stan Feb 12 '11 at 13:39
  • @Tr1stan - your extension method needs a Trim() I believe. – Mike Cole May 09 '13 at 22:20
  • It's also "Sep*a*rate" – Roger Lipscombe Jul 15 '13 at 15:22
  • 2
    To address the need for a trim may I suggest: ((?<=[a-z])[A-Z]|(?<!^)[A-Z](?=[a-z])) – Phil Jul 28 '14 at 08:55
  • 2
    A unicode-aware version of what @AlanMoore posted, no `.Trim()` call needed since it doesn't match the first letter: `@"(\p{Lu})(?<=\p{Ll}\1|(\p{Lu}|\p{Ll})\1(?=\p{Ll}))"` – johnnyRose Mar 07 '18 at 18:26
14

any uppercase character that is not followed by an uppercase character:

Replace(string, "([A-Z])(?![A-Z])", " $1")

Edit:

I just noticed that you're using this for enumerations. I really do not encourage using string representations of enumerations like this, and the problems at hand is a good reason why. Have a look at this instead: http://www.refactoring.com/catalog/replaceTypeCodeWithClass.html

David Hedlund
  • 128,221
  • 31
  • 203
  • 222
  • 2
    That doesn't handle "I", i.e. "IAmBored" will not be split as "I Am Bored" as I assume the OP would expect. – Brian Rasmussen Jul 08 '09 at 13:16
  • i think you're mistaken. try this javascript for yourself: alert("IAmBored".replace(/([A-Z])(?![A-Z])/g, " $1")); it will match "A" and "B" as both are not followed by an uppercase character, and be replaced into " A" and " B" respectively – David Hedlund Jul 08 '09 at 13:52
  • (although i just realized that you're just mistaken with your choice of example, the general point is still accurate, for when the "I" is in the middle of a sentence) – David Hedlund Jul 08 '09 at 13:57
  • It also inserts a space before the "A" in "BornInTheUSA". – Alan Moore Dec 19 '09 at 10:43
  • This doesn't work. "aB" and "aBB" won't be split at all, when I'd expect "a B" and "a BB" respectively. The split sequence should actually be "upper followed by lower, not preceded by upper". – Triynko Dec 10 '20 at 19:37
2

I hope this will help you regarding splitting a string by its capital letters and much more. You can try using Humanizer, which is a free nuget package. This will save you for more trouble with letters, sentences, numbers, quantities and much more in many languages. Check out this at: https://www.nuget.org/packages/Humanizer/

Gabriel Marius Popescu
  • 2,016
  • 2
  • 20
  • 22
1

You might think about changing the enumerations; MS coding guidelines suggest Pascal casing acronyms as though they were words; XmlDocument, HtmlWriter, etc. Two-letter acryonyms don't follow this rule, though; System.IO.

So you should be using UsaToday, and your problem will disappear.

Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
  • While I'm totally with you in general, this does not really solve the problem. If he'd written UsaToday, this would result in the split (i.e. human-readable) string as "Usa Today", which is kind of strange since it's always written USA. Therefore I can understand the desire to retain capitalization. On the other hand, if one wanted to show enum names to users, one should go with another solution (I tend to have string resources like EnumName_ValueName, so the key can be easily generated in code, are searchable in the resource file and can be easily localized). – OregonGhost Jul 08 '09 at 14:23
0

Tomalak's expression worked for me, but not with the built-in Replace function. Regex.Replace(), however, did work.

For i As Integer = 0 To names.Length - 1
  'Worked
  names(i) = Regex.Replace(names(i), "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1").TrimStart()

  ' Didn't work
  'names(i) = Replace(names(i), "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " $1").TrimStart()
Next

BTW, I'm using this to split the words in enumeration names for display in the UI and it works beautifully.

Craig Boland
  • 984
  • 1
  • 10
  • 18
0

My version that also handles simple arithmetic expressions:

private string InjectSpaces(string s)
{
    var patterns = new string[] {
        @"(?<=[^A-Z,&])[A-Z]",          // match capital preceded by any non-capital except ampersand
        @"(?<=[A-Z])[A-Z](?=[a-z])",    // match capital preceded by capital and followed by lowercase letter
        @"[\+\-\*\/\=]",                // match arithmetic operators
        @"(?<=[\+\-\*\/\=])[0-9,\(]"    // match 0-9 or open paren preceded by arithmetic operator
    };
    var pattern = $"({string.Join("|", patterns)})";
    return Regex.Replace(s, pattern, " $1");
}
Casey Chester
  • 268
  • 3
  • 11
-1

Note: I didn't read the question good enough, USAToday will return "Today"; so this anwser isn't the right one.

    public static List<string> SplitOnCamelCase(string text)
    {
        List<string> list = new List<string> ();
        Regex regex = new Regex(@"(\p{Lu}\p{Ll}+)");
        foreach (Match match in regex.Matches(text))
        {
            list.Add (match.Value);
        }
        return list;
    }

This will match "WakeOnBoot" as "Wake On Boot" and doesn't return anything on NMI or TLA