5

I have a set of regex replacements that are needed to be applied to a set of String,

For example:

  1. all multiple spaces with single space ("\s{2,}" --> " ")
  2. all . followed by a char with . followed by space followed by the char (\.([a-zA-Z]-->". $1")

So I will have something like this:

String s="hello     .how are you?";
s=s.replaceAll("\\s{2,}"," ");
s=s.replaceAll("\\.([a-zA-Z])",". $1");
....

it works , however imagine I'm trying to replace 100+ such expressions on a long String. needless to say how slow this can be.

so my question is if there is a more efficient way to generalize these replacements with a single replaceAll (or something similar e.g. Pattern/Matcher)

I have followed Java Replacing multiple different...,

but the problem is that my regex(s) are not simple Strings.

Community
  • 1
  • 1
nafas
  • 5,283
  • 3
  • 29
  • 57
  • You can use a single big regex and [`Matcher.appendReplacement`](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-). You'll have to be very careful with your regex however - as it maybe get somewhat messy and possibly suffer from catastrophic backtracking. – Boris the Spider Dec 09 '15 at 14:08
  • @BoristheSpider if I use this then I have the problem of knowing which regex is been used. – nafas Dec 09 '15 at 14:10
  • Nope, simply use capturing groups and check which one has data in it. – Boris the Spider Dec 09 '15 at 14:10
  • @BoristheSpider let's say I matched `.A` how would I know if this was matched using `\\.([a-zA-Z])` – nafas Dec 09 '15 at 14:13
  • If you have a pattern, for example `(A)|(B)` then you know, when you get a match, either group 1 or group 2 will be filled - the other will be empty (barring [this bug](https://stackoverflow.com/questions/22557708/regex-possesive-quantifier)). You can use that to determine the replacement. – Boris the Spider Dec 09 '15 at 14:18
  • I feel this is turning int a `xy problem` – nafas Dec 09 '15 at 14:35
  • If any answer actually answers your question then you should accept it. – Aseem Bansal Feb 10 '16 at 04:21

2 Answers2

4

You have these 2 replaceAll calls:

s = s.replaceAll("\\s{2,}"," ");
s = s.replaceAll("\\.([a-zA-Z])",". $1");

You can combine them into a single replaceAll like this:

s = s.replaceAll("\\s{2,}|(\\.)(?=[a-zA-Z])", "$1 ");

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • This is a great observation mate, but unfortunately there are many other rules that can't fit into such a technique – nafas Dec 09 '15 at 15:06
  • I posted answer based on the code you have in question. If you show more code then I can better judge what can be done to optimize it. – anubhava Dec 09 '15 at 15:08
  • thx mate, there are over 100 rules, obviously pointless to add them all, there is this one as well `([a-zA-Z])\\-,([a-zA-Z])` --> `$1-$2` – nafas Dec 09 '15 at 15:27
  • 1
    @nafas This is a basis of a good solution, even if your expressions are "complex". If you can group all your regexes based on common replacement expressions, then chain calls to `replaceAll()` using regex alternation (as in this example), it will be as efficient as you can get it. eg `s = s.replaceAll("\\s{2,}|(\\.)(?=[a-zA-Z])", "$1 ").replaceAll("foo|bar|baz", "qux").replaceAll...;` – Bohemian Dec 09 '15 at 15:29
  • 1
    @Bohemian I agree, to be honest it took me by surprise when anubhava managed to combine them two. this way I can reduce the number of regexs i'm using but still have to have many repalceAll and etc..., I guess there is no hope for a single liner or anything more sophisticated. – nafas Dec 09 '15 at 15:38
1

Look at Replace multiple substrings at Once and modify it.

Use a Map<Integer, Function<Matcher, String>>.

  • group numbers as Integer keys
  • Lambdas as values

Modify the loop to check which group was matched. Then use that group number for getting the replacement lambda.

Pseudo code

Map<Integer, Function<Matcher, String>> replacements = new HashMap<>() {{
    put(1, matcher -> "");
    put(2, matcher -> " " + matcher.group(2));
}};

String input = "lorem substr1 ipsum substr2 dolor substr3 amet";

// create the pattern joining the keys with '|'. Need to add groups for referencing later
String regexp = "(\\s{2,})|(\\.(?:[a-zA-Z]))";

StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(input);

while (m.find()) {
    //TODO change to find which groupNum matched
    m.appendReplacement(sb, replacements.get(m.group(groupNum)));
}
m.appendTail(sb);


System.out.println(sb.toString());   // lorem repl1 ipsum repl2 dolor repl3 amet
Aseem Bansal
  • 6,722
  • 13
  • 46
  • 84