-1

Every time I need something with regular expressions I have such a hard time...

Now, I need turn some fuzzy text... In an old, very old database, certain system didn't allow to users format their texts... so, users got creative, entering expressions like:

S O M E   T E X T   I   W O U L D   L I K E   T O   H I G H L I G H T

My question is, how can I turn that text in:

SOME TEXT I WOULD LIKE TO HIGHLIGHT

with regular expressions in Java.

Sorry about the silly question, but I've spent a lot more time trying figure this out than I was supposed to.

5 Answers5

3

This regex will gives you a single space at the middle that is, a single space between words.

String r = "S O M E   T E X T   I   W O U L D   L I K E   T O   H I G H L I G H T";
System.out.println(r.replaceAll("(\\s){2,}|\\s", "$1"));

Output:

SOME TEXT I WOULD LIKE TO HIGHLIGHT

The idea behind this is, the above regex would capture a a single space from two or more consecutive spaces and all other spaces or further matched. Replacing the matched spaces with the character inside group index 1 will give you the desired output.

Regex Demo

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • It really works, but I couldn't understand it... As I can read, the expression says: [2 or more white-spaces OR 1 white-space] replaces for [white-space], since the group 1 is always a white-space... Did I understand wrong? – Alex Gouvêa Vasconcelos Mar 06 '15 at 18:21
  • @AlexGouvêaVasconcelos Idea is to let regex test if there are more than one spaces first, and if that is true replace it with first one. In case there is only one space match in group `1` will be empty so you will replace this single space with nothing (you will remove it). – Pshemo Mar 06 '15 at 18:26
  • I see... I wasn't realizing that the quantifying that follows the group creates a condition for it's value, and because of this the '$1' that represents the group could be empty... – Alex Gouvêa Vasconcelos Mar 06 '15 at 18:40
  • It is similar to `(foo)+` and `(foo+)`. For data like `ZfoofooZ` both regexes will match `foofoo` part, but in first case since group doesn't include repeating (`+`) its value will be each time replaced with new `foo` it will find. So I lied a little by saying that "and if that is true replace it with first one" because it will replace with last one (but since both values will be same it doesn't change anything). – Pshemo Mar 06 '15 at 18:47
2

With one Pattern, no Lookaheads, no word border anchors

text.replaceAll("\\s(\\s?)\\s*", "$1")

Explanation:

  • replace any whitespace sequence with a minimum length of 1 (\s)
  • if the next char is a whitespace ((\s?) is matched) => replace with whitespace
  • else ((\s?) is not matched) replace with empty String
  • capture all whitespaces after (\s*)
CoronA
  • 7,717
  • 2
  • 26
  • 53
  • In fact, this is exactly what I was looking for. Thank you StefanA. Now, I have to figure out when do I need apply the replace, since, sometimes, the text is written using that technique, sometimes it's not (did I mention the user has been creative :-) )... I mean, sometimes you find [A...T.E.X.T] sometimes [A.TEXT], and, if I always make the replacement, in the second case I'll have [ATEXT]. (dots instead of white-spaces, so they can be seen in this comment). – Alex Gouvêa Vasconcelos Mar 06 '15 at 17:54
1

If the words are seperated by multiple spaces, you can use negatve look ahead as

\s(?!\s)

Regex demo

Test

"S O M E   T E X T   I   W O U L D   L I K E   T O   H I G H L I G H T"
.replaceAll("\\s(?!\\s)", "")
.replaceAll("\\s+", " ");
=> SOME TEXT I WOULD LIKE TO HIGHLIGHT
Pshemo
  • 122,468
  • 25
  • 185
  • 269
nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
1

So you can use replaceAll("(.)\\s", "$1")

Example:

String s = "S O M E   T E X T   I   W O U L D   L I K E   T O   H I G H L I G H T";
s = s.replaceAll("(.)\\s", "$1");
System.out.println(s);

Output: SOME TEXT I WOULD LIKE TO HIGHLIGHT


Explanation:

Think of your text as two characters chunks (I will mark them with ^^ and ##).

S O M E   T E X T
^^##^^##^^##^^##

If you look closely you will notice that you want to remove second character from each pair (which is space), and leave first character:

S O M E   T E X T
^ # ^ # ^ # ^ # T - T will not be affected (will stay) 
                    because it doesn't have space after it.

You can achieve it with (.)\s regex where

  • . represents any character (including space)
  • \s represents any whitespace

This way first character will be placed in group (indexed as 1) which allows us to use match from this part in replacement part via $x where x represents group index.


Ver.2 (in case spaces to remove are not only on odd indexed positions)

Other way to solve this problem is to remove only these spaces which

  • are placed right after non-space character (?<=\\S)\\s

    S O M E       T E X T
     ^ ^ ^ ^       ^ ^ ^
    
  • are placed before other spaces \\s(?=\\s)

    S O M E       T E X T
     ^ ^ ^ ^#####  ^ ^ ^
    

This way as you can see one space is left (the one right before word) so your solution can look like

s = s.replaceAll("(?<=\\S)\\s|\\s+(?=\\s)", "");
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • The expression works only with 2 or 3 spaces. – CoronA Mar 06 '15 at 14:54
  • @StefanA Which is what OP gave us in his question. I suspect that string from OP example was created precisely by adding space after every character in original string, except last one. – Pshemo Mar 06 '15 at 14:57
  • Actually, it's a very clever way to "undo" what the user may have done. But @StepfanA is right... it's possible that this is not the only case we can find in those texts. Thank you anyway, Pshemo. – Alex Gouvêa Vasconcelos Mar 06 '15 at 18:18
0

If all the words are guaranteed to have two or more spaces in between then:

  • First, remove all the spaces between the characters as

    input.replaceAll("(?<=\\S)\\s(?=\\S)", "");
    
  • Then, replace all the multiple spaces between the words with just one

    input.replaceAll("\\s{2,}", " ");
    

So, the complete code would look like

String input = "S O M E   T E X T   I   W O U L D   L I K E   T O   H I G H L I G H T";
input = input.replaceAll("(?<=\\S)\\s(?=\\S)", "").replaceAll("\\s{2,}", " ");

System.out.println(input); // SOME TEXT I WOULD LIKE TO HIGHLIGHT
Ravi K Thapliyal
  • 51,095
  • 9
  • 76
  • 89