0

I am trying to use the Java regex matcher to search and replace. However, after it failed to match a certain string, I noticed that the expression ".*" seems to fail to match certain Unicode characters (in my case it was a \u2028 LINE SEPARATOR character).

This is what I have at the moment (match an XML element with any text in between):

String segSourceSearch = "<source(.?)>(.*?)</source>";
String segSourceReplace = "<source$1>$2</source><target$1>$2</target>";
myString = myString.replaceAll(segSourceSearch, segSourceReplace);

Basically, what this is supposed to do is duplicate the element. But how can I modify the regex (.*?) to match any Unicode character between <source> and </source>? Is there a built-in pattern in Java? If not, is there anything in ICU4J that I could use? (I haven't been able to find a regex matcher in ICU4J).

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
marw
  • 403
  • 2
  • 8
  • 17
  • 1
    See http://stackoverflow.com/questions/3651725/match-multiline-text-using-regular-expression – peter.petrov Dec 02 '13 at 11:58
  • 1
    Take a look here: [Similar case][1] [1]: http://stackoverflow.com/questions/10894122/java-regex-for-support-unicode – user2279268 Dec 02 '13 at 11:59
  • Thanks a lot for the hint. But this solution is for matching characters from the Letter class. This would not match non-letter characters, such as the Line Separator. – marw Dec 02 '13 at 12:29
  • @marw, you seem to have a typo in your regex. I'm pretty sure `(.?)` was supposed to be `(.*?)`. – Alan Moore Dec 02 '13 at 14:20

1 Answers1

2

Pattern.DOTALL:

Enables dotall mode. In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s).

So the pattern you are looking for is (?s).*?, for capturing you still have to enclose it in braces, ((?s).*?), but you can also place the (?s) at the beginning of the entire expression to enable the DOTALL mode for the entire regex.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Thanks a lot for the replies. @Alan, you are correct. That was indeed a typo. – marw Dec 05 '13 at 12:13
  • thanks a lot for the solution. In the meantime, I have gone down a different route. The text that I wanted to process was in XML and it turned that my transformation was easier to do using XML functions. Thanks anyway. I will bear the Dotall patter in mind for the future. – marw Dec 05 '13 at 12:14