13

What regular expression can be used to make the following conversions?

City -> CITY
FirstName -> FIRST_NAME
DOB -> DOB
PATId -> PAT_ID
RoomNO -> ROOM_NO

The following almost works - it just adds an extra underscore to the beginning of the word:

var rgx = @"(?x)( [A-Z][a-z,0-9]+ | [A-Z]+(?![a-z]) )";

var tests = new string[] { "City",
                           "FirstName",
                           "DOB",
                           "PATId",
                           "RoomNO"};

foreach (var test in tests)
    Console.WriteLine("{0} -> {1}", test, 
                       Regex.Replace(test, rgx, "_$0").ToUpper());


// output:
// City -> _CITY
// FirstName -> _FIRST_NAME
// DOB -> _DOB
// PATId -> _PAT_ID
// RoomNO -> _ROOM_NO
MCS
  • 22,113
  • 20
  • 62
  • 76

6 Answers6

17

Flowing from John M Gant's idea of adding underscores then capitalizing, I think this regular expression should work:

([A-Z])([A-Z][a-z])|([a-z0-9])([A-Z])

replacing with:

$1$3_$2$4

You can rename the capture zones to make the replace string a little nicer to read. Only $1 or $3 should have a value, same with $2 and $4. The general idea is to add underscores when:

  • There are two capital letters followed by a lower case letter, place the underscore between the two capital letters. (PATId -> PAT_Id)
  • There is a small letter followed by a capital letter, place the underscore in the middle of the two. (RoomNO -> Room_NO and FirstName -> First_Name)

Hope this helps.

John McDonald
  • 1,790
  • 13
  • 20
  • 1
    +1, you did a better job than I of decomposing the examples into rules. I do prefer the \p syntax for better internationalization, though the examples in the question use the [A-Z] syntax. – John M Gant Dec 22 '10 at 19:34
11

I suggest a simple Regex to insert the underscore, and then string.ToUpper() to convert to uppercase.

Regex.Replace(test, @"(\p{Ll})(\p{Lu})", "$1_$2").ToUpper()

It's two operations instead of one, but to me it's much easier to read than one big complicated regex replace.

John M Gant
  • 18,970
  • 18
  • 64
  • 82
  • Nice and concise, but it doesn't handle multiple uppercase letters in a row properly. For example, it converts PATId to PATID instead of PAT_ID. – MCS Dec 22 '10 at 16:22
  • Yeah, I see what you mean. `@"(\p{L})(\p{Lu})(\p{Ll})", "$1_$2$3"` fixes that, but it doesn't work on RoomNO. – John M Gant Dec 22 '10 at 16:27
  • (For what it's worth, Neither PATId nor RoomNO is really what I'd normally consider CamelCase, but you did specify them in your question. Anyway, I'll leave this here in case it's helpful.) – John M Gant Dec 22 '10 at 16:32
2

I can probably come up with a regex that will do it... but I believe a transformative regex may not be the right answer. I suggest you take what you already have and just chop the first character (the leading underscore) off the output. The CPU time is probably going to be the same or less that way, and your coding time inconsequential.

Try: (?x)(.)( [A-Z][a-z,0-9]+ | [A-Z]+(?![a-z]) ) and change you code to output $0_$1 instead of _$0 <--misguided and failed attempt to dream up what I said was a silly idea.

Jeff Ferland
  • 17,832
  • 7
  • 46
  • 76
  • That converts DOB -> D_OB and PATId -> P_ATID. – MCS Dec 22 '10 at 16:26
  • Indeed, it fails. Spending a little more time on it, I take back my statement about how I can probably come up with a single regex to do it. This is a two step problem. I suggest either doing it with two regexes as some have shown in other answers, or removing the underscore at the start as the second step. Trying to make this whole process go with one regex is shoehorning. – Jeff Ferland Dec 22 '10 at 19:16
1

I realize this is an old question, but it is still something that comes up often, so I have decided to share my own approach to it.

Instead of trying to do it with replacements, the idea is to find all “words” in the string and then convert them to upper case and join:

var tests = new string[] { "City",
                "FirstName",
                "DOB",
                "PATId",
                "RoomNO"};
foreach (var test in tests)
    Console.WriteLine("{0} -> {1}", test,
                        String.Join("_", new Regex(@"^(\p{Lu}(?:\p{Lu}*|[\p{Ll}\d]*))*$")
                            .Match(test)
                            .Groups[1]
                            .Captures
                            .Cast<Capture>()
                            .Select(c => c.Value.ToUpper())));

Not terribly concise, but allows you to concentrate on defining what a “word” is, exactly, instead of struggling with anchors, separators and whatnot. In this case I've defined a word as something starting with an uppercase letter following by either a sequence of uppercase letters or a mix of lowercase and uppercase letters. I could have wanted to separate sequences of digits, too. "^(\p{Lu}(?:\p{Lu}*|\p{Ll}*)|\d+)*$" would do the trick. Or maybe I wanted to have the digits as a part of the previous uppercase word, then I'd do "^(\p{Lu}(?:[\p{Lu}\d]*|[\p{Ll}\d]*))*$".

Sergei Tachenov
  • 24,345
  • 8
  • 57
  • 73
1

Seems like Rails does it using more than one regular expression.

var rgx = @"([A-Z]+)([A-Z][a-z])";
var rgx2 = @"([a-z\d])([A-Z])";

foreach (var test in tests)
{
    var result = Regex.Replace(test, rgx, "$1_$2");
    result = Regex.Replace(result, rgx2, "$1_$2");
    result = result.ToUpper();
    Console.WriteLine("{0} -> {1}", test, result);
}
Community
  • 1
  • 1
MCS
  • 22,113
  • 20
  • 62
  • 76
0

There is no javascript answer here, so may as well add it.

( This is using the regex from @John McDonald )

var text = "fooBar barFoo";
var newText = text.replace(/([A-Z])([A-Z][a-z])|([a-z0-9])([A-Z])/g, "$1$3_$2$4");
newText.toLowerCase()
Vetras
  • 1,609
  • 22
  • 40