0

I'm trying to find the main stem of arabic word the user will enter لاعبون and the program will try to remove ون from the word, the remain part of the word will be لاعب and then try to find the main stem لعب in my list of stems, can i do that with regex or any advice. Thanks

  • regex gets applied on word and characters. make sure the language u are going to use regex on has UNICIDE/UTF-8 support. Then after that it will be regular string and regular regex. Nothing changes there because regex is locale independent – Acewin Dec 08 '16 at 16:40
  • 2
    Java [Pattern](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) supports a number of Unicode scripts, blocks and categories that you may leverage for this. I think the question's too broad as is. Try to add some code and expected output illustrating what you're trying to do. – Mena Dec 08 '16 at 16:40
  • You should be able to identify patterns using regex the same way as for English, as long as you are using the correct character encoding. – Chetan Jadhav CD Dec 08 '16 at 16:41
  • You might have some use of [this answer](http://stackoverflow.com/a/24028429/515948) or, better yet, in the [documentation](http://unicode.org/reports/tr18/#Categories) – Laur Ivan Dec 08 '16 at 16:41
  • 2
    @Hani It's a question with no research evidence or problem statement. It's a bad question. – 4castle Dec 08 '16 at 16:47
  • @Hani you shouldn't upvote the question just because somebody else dislikes it. Your votes should be based on your own opinions. The fact that you don't understand why someone downvoted it doesn't make your half-hearted opinion more valid than their genuinely held one. – Dawood ibn Kareem Dec 08 '16 at 16:49
  • This is an extremely broad and complex question, and to answer it correctly requires knowledge of Arabic morphology as well as regular expressions. I don't think you'll get a good answer from the Stack Overflow community. I've voted to close it as "too broad". – Dawood ibn Kareem Dec 08 '16 at 17:05

5 Answers5

1

Most regex engines these days, including Java's, support Unicode. For your particular case, you want something like this:

String text = "لاعبون";
text.replaceAll("\\u0648\\u0646", "");

Basically, all you need to do is replace every specific Unicode codepoint you want removed with the empty string. Done and done.

Sebastian Lenartowicz
  • 4,695
  • 4
  • 28
  • 39
  • This is an oversimplification. OP has a lot of text to deal with, and he only wants to remove `"ون "` when it occurs at the end of a word. Also, this only attempts to answer half of his question. He wants to extract the stem `"لعب "` from the remainder of the word. For the latter, he can't just remove all alif characters, because some alifs are part of the stem and some are not. – Dawood ibn Kareem Dec 08 '16 at 16:54
0

do you even need to use the encoded "code points"? this works:

regex: ون(.*)

replace: $1

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
0

here is full example

import java.util.regex.Matcher;

import java.util.regex.Pattern;


public class regex {

public static void main(String args[]) {
    Pattern p = Pattern.compile("(.*)" + "ون");
    Matcher m = p.matcher("لاعبون");
    Matcher m2 = p.matcher("يييي");
    System.out.println(m.matches());
    System.out.println(m.group(1));
    System.out.println(m2.matches());

}

}

will print

true
لاعب
false
Hani
  • 1,354
  • 10
  • 20
0

Since each glyf keeps the charcter codes there is no big difference comparing to English for example. you should just write down patterns to match 3 character roots and then write syntax to convert them to another pattern/template.

Mehdi
  • 4,202
  • 5
  • 20
  • 36
-1

The issue you're describing will have a large set of variables. Do you know all the prefixes, suffixes, can you make a list of them?

If you can do both of the above, that gives you a list that you can then test your word against and remove characters as appropriate.

See a previous answer to a similar question (How to ban words with diacritics using a blacklist array and regex?)

Convert your characters to a character representation in UTF-8 (I believe this will save you some trouble.)

Then using simple regex.

Lets say (because I can't convert these myself right now) ون = x021-x023

Your works (converted to 16 bit) pushed into regex and passed through this > s/^x021-x023//g

would trim the x021-x023 off your word.

Covert it back into your normal character set.

And you have your trimmed short word.

Community
  • 1
  • 1
TolMera
  • 452
  • 10
  • 25