0

I have a string with multiple white spaces in the beginning, middle and end: " Humpty Dumpty sat ".

I used regular expression (https://stackoverflow.com/a/2932439/13136767) to remove the extra whitespaces and replace it with group 1 (which is an empty space).

String str = "        Humpty   Dumpty   sat  ";
str = str.replaceAll("^ +| +$|( )+", "$1");
System.out.println("[" + str + "]");

Expected Output:

[ Humpty Dumpty sat ]

Actual Output:

[Humpty Dumpty sat]

A replacement string, is the text that each regular expression match is replaced with during a search-and-replace. The large whitespace at the beginning of the String should have been replaced by an empty space. Why did it not add an empty space, here, at the beginning of the String?

dipindashoff
  • 103
  • 1
  • 6
  • 2
    Any particular reason why you want to leave one extra space at the start and end? – Tim Biegeleisen Mar 20 '21 at 14:17
  • My understanding is that all the matches should be replaced by group 1. So, I expect an empty space (group 1) at the start and end of the sentence (and the middle). I could not understand why the empty spaces at the start and end did not get replaced by group 1 i.e an empty space. – dipindashoff Mar 21 '21 at 16:34

4 Answers4

2

A simple solution can be replacing a sequence of multiple whitespace characters with a single whitespace character.

Demo:

public class Main {
    public static void main(String args[]) {
        String str = "     Humpty   Dumpty   sat ";
        System.out.println("->" + str + "<-");

        str = str.replaceAll("\\s+", " ");
        System.out.println("->" + str + "<-");
    }
}

Output:

->     Humpty   Dumpty   sat <-
-> Humpty Dumpty sat <-
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110
2

Why did it not add an empty space, here, at the beginning of the String?

Because the regex you're using is specifically designed not to add spaces at the beginning or end of the string:

str.replaceAll("^ +| +$|( )+", "$1");

Here we have three alternatives: ^ +, +$ and ( )+. All three alternatives match one or more spaces. The difference is that the first two only match at the beginning and end of the string respectively and that only the third one contains a capturing group. So if the third one is matched, i.e. if the sequence of spaces is not at the beginning or end of the string, the value of $1 will be a space. Otherwise it will be empty.

The whole point of this is to not add spaces at the beginning or end. If you don't want this behaviour, you don't need any of this complexity. Just replace one or more spaces with a single space and that's it.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
1

I don't know what your goal is here, but if you want to remove extra spaces only in between words, then I would suggest using lookarounds:

String str = "        Humpty   Dumpty   sat  ";
String output = str.replaceAll("\\b(\\w+)[ ]{2,}(?=\\w)", "$1 ");
System.out.println("|" + input + "|");
System.out.println("|" + output + "|");

This prints:

|        Humpty   Dumpty   sat  |
|        Humpty Dumpty sat  |
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

When replaceAll performs multiple replacements, any captures are only available if they matched during the current replacement. Captures from earlier or later matches can't be used.

This means that when the spaces at the beginning and end of the string are replaced, $1 isn't available since the ( )+ alternation wasn't matched. $1 is only available in the middle of the string when the non-anchored alternation matches.

We can see this in an even simpler example:

String str = "foobar";
System.out.println(str.replaceAll("(foo)|bar", "<$1>")); 

If $1 were remembered then we'd expect to see this output:

<foo><foo>

It's not, though. The actual output has a blank where bar used to be:

<foo><>

This shows that $1 is cleared after foo is matched and is empty when bar is replaced.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578