1

Any simple unicode string like زسس or یسیتنانت matches in c# regex using the following pattern but they don’t match in java.

Can anyone explain this? How do I correct it for it to work in Java?

 "\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b"

c# code :(it matches the strings)

   private static readonly Regex s_regexEngine;


    private static readonly string s_wordPattern = @"\b[\w\p{M}\u200B\u200C\u00AC\u001F\u200D\u200E\u200F]+\b";

    static PersianWordTokenizer()
    {
        s_regexEngine = new Regex(s_wordPattern, RegexOptions.Multiline);
    }

    public static List<string> Tokenize(string text, bool removeSeparators, bool standardized)
    {
        List<string> tokens = new List<string>();

        int strIndex = 0;
        foreach (Match match in s_regexEngine.Matches(text))
        {
            //Enter in this block
        }

java code:(it dosnt matches string)

 private static final String s_wordPattern = "\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b";

static
{
    s_regexpattern = Pattern.compile(Pattern.quote(s_wordPattern));
}

public static java.util.ArrayList<String> Tokenize(String text, boolean removeSeparators, boolean standardized)
{
    java.util.ArrayList<String> tokens = new java.util.ArrayList<String>();

    int strIndex = 0;
    s_regexEngine=s_regexpattern.matcher(text);
    while(s_regexEngine.find())
    {
              // it dosnt enter in this block
            }
Navid
  • 23
  • 1
  • 5
  • 2
    @VladL [Are you sure about that?](http://stackoverflow.com/questions/538579/are-java-and-c-sharp-regular-expressions-compatible) :) – Pshemo Jan 13 '13 at 23:00
  • yes , you can test it, if possible – Navid Jan 13 '13 at 23:18
  • 1
    @Pshemo well, not any more :) I think the OP's problem is how he delivers text to the regex function. In .NET unicode is enabled by default, don't know how it is in java. – VladL Jan 14 '13 at 00:01

3 Answers3

0

The regular expression itself does not change between .NET and Java, so here is roughly how you would use it in Java.

package regexdemo;
import java.util.regex.*;

public class void main(String[] args) {
    String term = "Hello-World";
    boolean found = false;
    Pattern p = Pattern.compile("\\b[\\w\\p{M}\\u200B\\u200C\\u00AC\\u001F\\u200D\\u200E\\u200F]+\\b");
    Matcher m = p.matcher(term);
    if (matcher.find()){
        found = true;
    }
}

Also as a starting point for deceminating the different flavors for regex I'd recommend you look at the sites

http://docs.oracle.com/javase/tutorial/essential/regex/index.html
http://www.regular-expressions.info/

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Terrance
  • 11,764
  • 4
  • 54
  • 80
0

Look at the "any letter" unicode character class, \p{L}, or at the Pattern.UNICODE_CHARACTER_CLASS parameter to the java Pattern.compile method.

I guess the second one, as being Java only, won't interest you, but is worth mentioning.

import java.util.regex.Pattern;

/**
 * @author Luc
 */
public class Test {

  /**
   * @param args
   */
  public static void main(final String[] args) {

    test("Bonjour");

    test("یسیتنانت");

    test("世界人权宣言 ");
  }

  private static void test(final String text) {

    showMatch(Pattern.compile("\\b\\p{L}+\\b"), text);

    showMatch(Pattern.compile("\\b\\w+\\b", Pattern.UNICODE_CHARACTER_CLASS), text);
  }

  private static void showMatch(final Pattern pattern, final String text) {

    System.out.println("With pattern \"" + pattern + "\": " + text + " " + pattern.matcher(text).find());
  }

}

Results :

With pattern "\b\w+\b": Bonjour true
With pattern "\b\p{L}+\b": Bonjour true
With pattern "\b\w+\b": یسیتنانت true
With pattern "\b\p{L}+\b": یسیتنانت true
With pattern "\b\w+\b": 世界人权宣言  true
With pattern "\b\p{L}+\b": 世界人权宣言  true
Sxilderik
  • 796
  • 6
  • 20
-4

Wrap the regex string in a call to java.util.regex.Pattern.quote. e.g., java.util.regex.Pattern.quote(yourCSharpRegexString).

Dave Doknjas
  • 6,394
  • 1
  • 15
  • 28