Regular expression, split string by capital letter but ignore TLA

Question

I'm using the regex

System.Text.RegularExpressions.Regex.Replace(stringToSplit, "([A-Z])", " $1").Trim()

to split strings by capital letter, for example:

'MyNameIsSimon' becomes 'My Name Is Simon'

I find this incredibly useful when working with enumerations. What I would like to do is change it slightly so that strings are only split if the next letter is a lowercase letter, for example:

'USAToday' would become 'USA Today'

Can this be done?

EDIT: Thanks to all for responding. I may not have entirely thought this through, in some cases 'A' and 'I' would need to be ignored but this is not possible (at least not in a meaningful way). In my case though the answers below do what I need. Thanks!

Hmmm... this might not be as simple as initially thought - what about a string like "TodayILiveInTheUSAWithSimon" - both current answers will fail for this. — Peter Boughton, Jul 08 '09 at 13:05
Good point. I can probably work around that though in this instance. — Simon, Jul 08 '09 at 13:15

Tomalak · Accepted Answer · 2014-04-08T08:53:21.513

56

((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))

or its Unicode-aware cousin

((?<=\p{Ll})\p{Lu}|\p{Lu}(?=\p{Ll}))

when replaced globally with

" $1"

handles

TodayILiveInTheUSAWithSimon
USAToday
IAmSOOOBored

yielding

 Today I Live In The USA With Simon
USA Today
I Am SOOO Bored

In a second step you'd have to trim the string.

edited Apr 08 '14 at 08:53

answered Jul 08 '09 at 13:21

Tomalak

332,285
67
532
628

Sorry, you lost me a bit! Like this: Replace(stringToSplit, "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " \1") ? – Simon Jul 08 '09 at 13:33
The `\1` means back-reference #1. In .NET regexes, this is expressed as `$1`. Other than that, your statement seems correct. – Tomalak Jul 08 '09 at 13:47
I've edited the answer so it uses the .NET style back-reference. – Tomalak Jul 08 '09 at 14:01
5

`([A-Z])(?<=[a-z]\1|[A-Za-z]\1(?=[a-z]))` doesn't add the space at the beginning because it can never match the first letter. :) – Alan Moore Dec 19 '09 at 05:18
7

Converted to string extension method: `public static string SeperateCamelCase(this string value) { return Regex.Replace(value, "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1"); }` – Tr1stan Feb 12 '11 at 13:39
@Tr1stan - your extension method needs a Trim() I believe. – Mike Cole May 09 '13 at 22:20
It's also "Sep*a*rate" – Roger Lipscombe Jul 15 '13 at 15:22
2

To address the need for a trim may I suggest: ((?<=[a-z])[A-Z]|(?<!^)[A-Z](?=[a-z])) – Phil Jul 28 '14 at 08:55
2

A unicode-aware version of what @AlanMoore posted, no `.Trim()` call needed since it doesn't match the first letter: `@"(\p{Lu})(?<=\p{Ll}\1|(\p{Lu}|\p{Ll})\1(?=\p{Ll}))"` – johnnyRose Mar 07 '18 at 18:26

David Hedlund · Answer 2 · 2009-07-08T14:02:25.700

14

any uppercase character that is not followed by an uppercase character:

Replace(string, "([A-Z])(?![A-Z])", " $1")

Edit:

I just noticed that you're using this for enumerations. I really do not encourage using string representations of enumerations like this, and the problems at hand is a good reason why. Have a look at this instead: http://www.refactoring.com/catalog/replaceTypeCodeWithClass.html

edited Jul 08 '09 at 14:02

answered Jul 08 '09 at 13:00

David Hedlund

128,221
31
203
222

2

That doesn't handle "I", i.e. "IAmBored" will not be split as "I Am Bored" as I assume the OP would expect. – Brian Rasmussen Jul 08 '09 at 13:16
i think you're mistaken. try this javascript for yourself: alert("IAmBored".replace(/([A-Z])(?![A-Z])/g, " $1")); it will match "A" and "B" as both are not followed by an uppercase character, and be replaced into " A" and " B" respectively – David Hedlund Jul 08 '09 at 13:52
(although i just realized that you're just mistaken with your choice of example, the general point is still accurate, for when the "I" is in the middle of a sentence) – David Hedlund Jul 08 '09 at 13:57
It also inserts a space before the "A" in "BornInTheUSA". – Alan Moore Dec 19 '09 at 10:43
This doesn't work. "aB" and "aBB" won't be split at all, when I'd expect "a B" and "a BB" respectively. The split sequence should actually be "upper followed by lower, not preceded by upper". – Triynko Dec 10 '20 at 19:37

score 2 · Answer 3 · answered Jan 12 '18 at 19:08

I hope this will help you regarding splitting a string by its capital letters and much more. You can try using Humanizer, which is a free nuget package. This will save you for more trouble with letters, sentences, numbers, quantities and much more in many languages. Check out this at: https://www.nuget.org/packages/Humanizer/

score 1 · Answer 4 · answered Jul 08 '09 at 13:03

1

You might think about changing the enumerations; MS coding guidelines suggest Pascal casing acronyms as though they were words; XmlDocument, HtmlWriter, etc. Two-letter acryonyms don't follow this rule, though; System.IO.

So you should be using UsaToday, and your problem will disappear.

answered Jul 08 '09 at 13:03

Steve Cooper

20,542
15
71
88

While I'm totally with you in general, this does not really solve the problem. If he'd written UsaToday, this would result in the split (i.e. human-readable) string as "Usa Today", which is kind of strange since it's always written USA. Therefore I can understand the desire to retain capitalization. On the other hand, if one wanted to show enum names to users, one should go with another solution (I tend to have string resources like EnumName_ValueName, so the key can be easily generated in code, are searchable in the resource file and can be easily localized). – OregonGhost Jul 08 '09 at 14:23

score 0 · Answer 5 · answered Dec 19 '09 at 00:34

Tomalak's expression worked for me, but not with the built-in Replace function. Regex.Replace(), however, did work.

For i As Integer = 0 To names.Length - 1
  'Worked
  names(i) = Regex.Replace(names(i), "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1").TrimStart()

  ' Didn't work
  'names(i) = Replace(names(i), "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " $1").TrimStart()
Next

BTW, I'm using this to split the words in enumeration names for display in the UI and it works beautifully.

Casey Chester · Answer 6 · 2017-08-08T14:57:20.343

My version that also handles simple arithmetic expressions:

private string InjectSpaces(string s)
{
    var patterns = new string[] {
        @"(?<=[^A-Z,&])[A-Z]",          // match capital preceded by any non-capital except ampersand
        @"(?<=[A-Z])[A-Z](?=[a-z])",    // match capital preceded by capital and followed by lowercase letter
        @"[\+\-\*\/\=]",                // match arithmetic operators
        @"(?<=[\+\-\*\/\=])[0-9,\(]"    // match 0-9 or open paren preceded by arithmetic operator
    };
    var pattern = $"({string.Join("|", patterns)})";
    return Regex.Replace(s, pattern, " $1");
}

MarijnStevens · Answer 7 · 2015-04-19T13:13:11.870

Note: I didn't read the question good enough, USAToday will return "Today"; so this anwser isn't the right one.

    public static List<string> SplitOnCamelCase(string text)
    {
        List<string> list = new List<string> ();
        Regex regex = new Regex(@"(\p{Lu}\p{Ll}+)");
        foreach (Match match in regex.Matches(text))
        {
            list.Add (match.Value);
        }
        return list;
    }

This will match "WakeOnBoot" as "Wake On Boot" and doesn't return anything on NMI or TLA

Regular expression, split string by capital letter but ignore TLA

7 Answers7

Linked

Related