Regex with Arabic expressions

Question

How do I make Javascript ignore the Arabic expression اعراب through Regex? For example I want that و and ؤ be equal and ا آ اَ اِ to be all equal and so on. Please help. Thanks a lot.

Salam :) you may need to add the solutions that you tried so far, because you cant just ask people to do some work for you and we are here just for help you by resolving issues in your code. — Neji Soltani, Aug 08 '18 at 14:34
If you want to ignore it, what do you want to not ignore? If you want those symbols to be treated equally, in what context? What are you trying to achieve? Note that i sadly can't read arabic, so for me, those are just some symbols without any meaning. — ASDFGerte, Aug 08 '18 at 14:39
Could you give the Unicode Character Names and / or the Unicode code points for those of us who don't read Arabic — JGNI, Aug 08 '18 at 14:40
@JGNI copy them as string into the browser console, and use `String.prototype.codePointAt` (or in this case probably even `String.prototype.charCodeAt`). Google should also give the names of the symbols fairly quick, e.g. [Waw](https://en.wikipedia.org/wiki/Waw_(letter)) — ASDFGerte, Aug 08 '18 at 14:44
This two links may be helpful: https://stackoverflow.com/questions/5224267/javascriptremove-arabic-text-diacritic-dynamically and https://stackoverflow.com/questions/36185493/javascript-regex-to-match-string-contain-arabic-special-characters-symbols-%D9%80-u — ibrahim mahrir, Aug 08 '18 at 15:05

Neji Soltani · Answer 1 · 2018-08-08T15:15:47.580

4

The solution is to convert the accent letter to it's non-accent equivalent so then it will be easier to check if they are equal.

Here's a simple code that will help you to identify accent and then replace it with the original letter.

var noAccentOrigin = {
  'ك': 'ک',
  'ﻷ': 'لا',
  'ؤ': 'و',
  'ى': 'ی',
  'ي': 'ی',
  'ئ': 'ی',
  'أ': 'ا',
  'إ': 'ا',
  'آ': 'ا',
  'ٱ': 'ا',
  'ٳ': 'ا',
  'ة': 'ه',
  'ء': '',
  'ِ': '',
  'ْ': '',
  'ُ': '',
  'َ': '',
  'ّ': '',
  'ٍ': '',
  'ً': '',
  'ٌ': '',
  'ٓ': '',
  'ٰ': '',
  'ٔ': '',
  '�': ''
}

var accentRemover = function(str) {
  return str.replace(/[^\u0000-\u007E]/g, function(a) {
    return noAccentOrigin[a] == undefined ? a : noAccentOrigin[a];
  });
}
var stringToTest = 'ا آ اَ اِ'
console.log('Original string :' + stringToTest)
console.log('Converted string :' + accentRemover(stringToTest))

//test example
console.log('Is ؤ and و are equal ? : ')
console.log(accentRemover('ؤ') == accentRemover('و'))

Hope that helps

edited Aug 08 '18 at 15:15

answered Aug 08 '18 at 14:58

Neji Soltani

1,522
4
22
41

There isnt a pattern to the unicode numbers is there? Like how you can subtract 32 from a to capitalize it? Or some way to split it into the actual constituent parts and strip out accents? – Marie Aug 08 '18 at 15:00
Hi @Marie, that's a good idea ! let me do some researches about that first – Neji Soltani Aug 08 '18 at 15:14
@Marie I don't think so. The thing is there isn't only two characters (`a` and `A`). There is more: all of `أ` and `إ` and `اَ` and `اً` and `اُ` and `اٌ` and `اْ` and `اِ` and many more have the same origin/root which is `ا`. Depending on the letter there could be less or more. – ibrahim mahrir Aug 08 '18 at 15:15
@NejiSoltani You could replace all the accents in one regex like so: `.replace(/[ِِ َ ً ُ ٌ ْ ٍ ]/g, "")` – ibrahim mahrir Aug 08 '18 at 15:19
@ibrahimmahrir Some symbols have dedicated unicode code points in addition to the symbol + joiner + accent, Does javascript account for that if you replace JUST the accent? – Marie Aug 08 '18 at 15:23
Regarding the different accents, if they are uniformly offset you could use modulo and add or subtract to get the base symbol\ – Marie Aug 08 '18 at 15:24
1

I working on that solution it's pretty difficult for Arabic letters – Neji Soltani Aug 08 '18 at 15:37
Is there a canonical decomposition form of the characters that will give the base character and the accents as separate code points. If so you could use a regex to remove every character with the Unicode property `Mark` just leaving the base characters to be compared. – JGNI Aug 09 '18 at 07:08

score 0 · Answer 2 · answered Aug 09 '18 at 07:17

I did some thinking about this and what you really need is to use the Unicode collation algorithm with a base level this question gives a good run-down of the problem and looking at the answers I'd suggest using the String.prototype.localeCompare() function giving the sensitivity as base

score 0 · Answer 3 · answered Aug 09 '18 at 07:36

Awesome but what i want is that when i search for "و" javascript also include "ؤ" in the search which means the function should not remove accent but rather add it. Thank's a lot

var noAccentOrigin = {
  'ك': 'ک',
  'ﻷ': 'لا',
  'ؤ': 'و',
  'ى': 'ی',
  'ي': 'ی',
  'ئ': 'ی',
  'أ': 'ا',
  'إ': 'ا',
  'آ': 'ا',
  'ٱ': 'ا',
  'ٳ': 'ا',
  'ة': 'ه',
  'ء': '',
  'ِ': '',
  'ْ': '',
  'ُ': '',
  'َ': '',
  'ّ': '',
  'ٍ': '',
  'ً': '',
  'ٌ': '',
  'ٓ': '',
  'ٰ': '',
  'ٔ': '',
  '�': ''
}

var accentRemover = function(str) {
  return str.replace(/[^\u0000-\u007E]/g, function(a) {
    return noAccentOrigin[a] == undefined ? a : noAccentOrigin[a];
  });
}
var stringToTest = 'ا آ اَ اِ'
console.log('Original string :' + stringToTest)
console.log('Converted string :' + accentRemover(stringToTest))

//test example
console.log('Is ؤ and و are equal ? : ')
console.log(accentRemover('ؤ') == accentRemover('و'))

Regex with Arabic expressions

3 Answers3