Format a string using regex in Java

Question

Is there any way I can format a string into a specific pattern using regex or is stringbuilder + substring a faster approach?

For example, say a phone number --> 1234567890 as input

And get output as --> (123) 456-7890

I saw it is possible on this article : https://web.archive.org/web/20211020111604/https://www.4guysfromrolla.com/webtech/031302-1.shtml but the given explanation is in ASP. How do I do it in Java ???

heh.... coming back after years... the NANP doesn't allow such a number (no area code can start with `1`) :-P — Code Jockey, May 12 '17 at 16:25

score 34 · Answer 1 · edited May 23 '17 at 12:02

Disclaimer

Since several answers have already addressed the greater efficiency of string builders, etc., I wanted to show you how it could be done with regex and address the benefits of using this approach.

One REGEX Solution

Using this matching regex (similar to Alan Moore's expression):

(.{3})(.{3})(.{4})

allows you to match precisely 10 characters into 3 groups, then use a replace expression that references those groups, with additional characters added:

($1) $2-$3

thus producing the replacement like you requested. Of course, it will also match punctuation and letters as well, which is a reason to use \d (encoded into a Java string as \\d) rather than the . wildcard character.

Why REGEX?

The potential advantage of a regex approach to something like this is the compression of "logic" to the string manipulation. Since all the "logic" can be compressed into a string of characters, rather than pre-compiled code, the regex matching and replacement strings can be stored in a database for easier manipulation, updating, or customization by an experienced user of the system. This makes the situation more complex on several levels, but allows considerably more flexibility for users.

With the other approaches (string manipulation), changing a formatting algorithm so that it will produce (555)123-4567 or 555.123.4567 instead of your specified (555) 123-4567 would essentially not be possible merely through the user interface. with the regex approach, the modification would be as simple as changing ($1) $2-$3 (in the database or similar store) into $1.$2.$3 or ($1)$2-$3 as appropriate.

If you wanted to modify your system to accept "dirtier" input, which might include various attempts at formatting, such as 555-123.4567 and reformat them to something consistent, it would be possible to make a string-manipulation algorithm that would be capable of this and recompile the application to work how you would like. With a regex solution, however, a system overhaul would not be necessary - merely change the parsing and replacement expressions like so (maybe a little complex for beginners to understand right away):

^\D*1?\D*([2-9])\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d).*$
($1$2$3) $4$5$6-$7$8$9$10

This would allow a significant "upgrade" in the program's ability, as shown in the following reformatting:

"Input"                       "Output"
----------------------------- --------------------------------
"1323-456-7890 540"           "(323) 456-7890"
"8648217634"                  "(864) 821-7634"
"453453453322"                "(453) 453-4533"
"@404-327-4532"               "(404) 327-4532"
"172830923423456"             "(728) 309-2342"
"jh345gjk26k65g3245"          "(345) 266-5324"
"jh3g24235h2g3j5h3"           "(324) 235-2353"
"12345678925x14"              "(234) 567-8925"
"+1 (322)485-9321"            "(322) 485-9321"
"804.555.1234"                "(804) 555-1234"
"08648217634"                 <no match or reformatting>

As you can see, it is very "tolerant" of input "formatting" and knows that 1 should be ignored at the beginning of the number and that 0 should cause an error because it is invalid - all stored in a single string.

The question comes down to performance vs. potential to customize. String manipulation is faster than regex, but future enhancement customization requires a recompile rather than a simple alteration of a string. That said, there are things that can't be expressed very well (or even in as readable a fashion as the above change) and some things that are not possible with regex.

TL;DR:

Regex allows storage of parsing algorithms into a relatively short string, which can be easily stored so as to be modifiable without recompiling. Simpler, more focused string manipulation functions are more efficient and can sometimes accomplish more than regex can. The key is to understand both tools and the requirements of the application and use the one most appropriate for the situation.

Nice explanation... you actually got the general idea why I asked the question. I always thought regex are faster and a _cleaner_ way of writing a code. But then again, in my case I'd rather go with Stringbuilder approach since the chances of my code formatting the string are very low. Nevertheless, great explanation. Thanks a lot. — Vrushank, Nov 20 '11 at 13:27
@VrushankDesai: if you feel your answer was provided, please select the best or most helpful answer by clicking the check/tick mark under the numbers at the left. — Code Jockey, Nov 21 '11 at 20:04
Just wanted to say thanks. I'm designing a web service for any potential client to use to build their own website... so I have no idea what kind of input I'm going to get. This is exactly what I needed. — Asaf, Jun 29 '12 at 00:20

score 17 · Accepted Answer · answered Nov 19 '11 at 20:08

One goes for RE when same can not be done using substring or is more difficult to do so.

In your case better to just use StringBuilder and insert()

Assuming phone number length validation is in place (=10 chars)

        String phoneNumber = "1234567890";
        StringBuilder sb = new StringBuilder(phoneNumber)
                                .insert(0,"(")
                                .insert(4,")")
                                .insert(8,"-");
        String output = sb.toString();
        System.out.println(output);

Output

(123)456-7890

score 12 · Answer 3 · answered Nov 19 '11 at 20:32

12

The same technique works in Java; you just have to adjust the to Java syntax and API:

s = s.replaceFirst("(\\d{3})(\\d{3})(\\d{4})", "($1) $2-$3");

I don't understand why you're asking about the faster approach, though. Have you tried something like this and experienced performance problems? You can almost certainly do this more efficiently with a StringBuilder, but in practical terms it's almost certainly not worth the effort.

Or were you talking about the time it would take to learn how to accomplish this with a regex relative to hand-coding it with a StringBuilder? That's kind of a moot point now, though. :D

answered Nov 19 '11 at 20:32

Alan Moore

73,866
12
100
156

I just asked because I thought using substring() would be performance intensive and would be a layman approach... :) – Vrushank Nov 20 '11 at 13:21
1

Doing string concatenations in a loop is the only absolute no-no. Choosing among other techniques is a matter of personal preference, unless testing exposes a real performance problem. – Alan Moore Nov 20 '11 at 13:55

score 2 · Answer 4 · answered Nov 19 '11 at 19:59

2

I would use a combination of java String.format() method and String.substring()

answered Nov 19 '11 at 19:59

Lucas

14,227
9
74
124

score 1 · Answer 5 · answered Nov 19 '11 at 19:57

1

A regular expression matcher with groups is really nothing else but a number of String containers, plus a lot of RE matching code. (You can actually look at the source code and see for yourself.) No way is this cheaper than just using substring() yourself, especially with a fixed offset as in your case.

answered Nov 19 '11 at 19:57

Kilian Foth

13,904
5
39
57

1

+1 for demystifying REs. In more complicated situations, a regex can be the way to go, but here the straight substring() code is right. – Daniel Fischer Nov 19 '11 at 20:19

score 0 · Answer 6 · answered Nov 19 '11 at 19:50

0

StringBuilder with substring will be faster, but not always the simplest/best approach. In this case I would just use substring.

String num = "1234567890";
String formatted = "(" + num.substring(0,3) + ") "
     + num.substring(3,6) + "-" + num.substring(6);

answered Nov 19 '11 at 19:50

Peter Lawrey

525,659
79
751
1,130

I downvote you answer, because it is bad way for string concatenation. – ilalex Nov 19 '11 at 20:15
If I'm not mistaken, the compiler puts out the same code you'd do with StringBuilder, as long as it's one expression like this. So do I upvote for the answer being better than you thought, or downvote because it was? ;-) – Ed Staub Nov 19 '11 at 20:19
1

@ilya, if you think this is less efficient than StringBuilder, you're mistaken. It would be less efficient if the "+" signs were in more than one statement - but they're not. See e.g. http://stackoverflow.com/questions/1532461/stringbuilder-vs-string-concatenation-in-tostring-in-java – Ed Staub Nov 19 '11 at 20:36

Format a string using regex in Java

6 Answers6

Disclaimer

One REGEX Solution

Why REGEX?

TL;DR:

Linked