multiple regular expressions vs search algorithm

Question

I have a text file where every line is a random combination of any of the following groups

Numbers - English Letters - Arabic Letters - Punctuation

\w which is composed of a-zA-Z0-9_ for the first 2 groups

\p{InArabic} for the third group

\p{Punct} which is composed of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ for the fifth group

I got this info from here

i read a line. The ONLY time I do something to this line is if the line contains Arabic letters AND (English letters OR Unicode Symbols)

After reading this post and this post I came up with the following expression. Obviously it's wrong as my output is all wrong >.<

pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");

Here's the input

1
1a
a!
aش
شa
ششa
aشش
شaش
aشa
!aش

The first three shouldn't be matched but my output shows that NONE are a match.

Edit: sorry I just realized that I forgot to change my title. But if any of you feel that searching is better performance wise then please suggest a search algorithm. Using search algo instead of regex looks ugly but I'd go with it if it performed better. Thanks to the posts I read, I learned that I can make regex faster if I put this in the constructor so that it'd be executed once only instead of including them in my loop thereby being executed everytime

pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
matcher = pattern.matcher("");

You can either construct one regular expression that is essentially `({english}+.*{arabic}+)|({arabic}+.*{english}+)`, or you could construct two patterns, one for arabic, one for english, and just see if they both match. The latter might be a little clearer. Alternatively you could ditch the regular expressions and just directly search for an arabic and an english character in the same string. — Jason C, Feb 23 '14 at 19:41
You did it using OR covering both possibilities. Thanks but I'm afraid the issue is with me not fully understanding how to write a proper expression hence my post. As for ur suggested alternative, how do i do that? I'd still need a way to see if ANY of the arabic AND english letters are in that string. Which algorithm do you suggest? Because the direct way is nested loops. Isn't that bad compared to regex? — user3340667, Feb 23 '14 at 19:54

Casimir et Hippolyte · Answer 1 · 2014-02-23T20:21:38.503

0

To follow your idea, the correct pattern is:

pattern = Pattern.compile("(?=.*\\p{InArabic})(?=.*[a-zA-Z\\p{Punct}])");

The same position in a string can not be followed by an arabic letter and a punctuation character or a latin letter at the same time. In other words, you have written an always false condition. Adding .* allows characters to be anywhere in the string.

If you want a more optimised pattern, you can use Jason C idea but with negative character classes to reduce the backtracking:

pattern = Pattern.compile("\\p{inArabic}[^a-zA-Z\\p{Punct}]*[a-zA-Z\\p{Punct}]|[a-zA-Z\\p{Punct}]\\P{inArabic}*\\p{inArabic}");

edited Feb 23 '14 at 20:21

answered Feb 23 '14 at 19:51

Casimir et Hippolyte

88,009
5
94
125

thank you. This works on the sample input. But I think I made a mistake. (?= regex) this is lookahead right? So basically if i were to read your regex, first look for any arabic characters in the string. this character can be located alone, or at the end of a word. Once you find it, look ahead for any english letter or punctuation. These can be found alone or at the end of a word. PLZ tell me if i'm right or not because there might be a slight misunderstanding. – user3340667 Feb 23 '14 at 20:07
@user3340667: the characters can be everywhere. – Casimir et Hippolyte Feb 23 '14 at 20:15
@Jason C ah yes your edited comment makes more sense now. Thank you both. – user3340667 Feb 23 '14 at 20:20

score 0 · Answer 2 · answered Feb 24 '14 at 00:06

0

If you want to find a line with a mix, all you really need are 2 boundry condition checks.
A sucessfull match indicates a mix.

   #   "\\p{InArabic}(?=[\\w\\p{Punct}])|(?<=[\\w\\p{Punct}])\\p{InArabic}"

   \p{InArabic} 
   (?= [\w\p{Punct}] )
|  
   (?<= [\w\p{Punct}] )
   \p{InArabic}

answered Feb 24 '14 at 00:06

There are specific conditions. If a line begins with a number and the line includes arabic then do something. If a line contains arabic and english then do something. If a line ends with punctuation and contains arabic then do something. Finally, if a line contains arabic and unicode characters do something. I wrote the following regex. it doesnt check for unicode. What do you think? "(^[0-9]+.*\\p{InArabic})|(\\p{InArabic}.*\\p{Punct}$)|(?=.*\\p{InArabic})(?=.*[a-zA-Z])" – user3340667 Feb 24 '14 at 22:11

multiple regular expressions vs search algorithm

2 Answers2