1

Can any one tell me regular expression for Arabic characters in Ruby?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Sivananda
  • 11
  • 6

2 Answers2

6

You can use the \p Character Properties:

/\p{Arabic}/

Example:

"مرحبا بالعالم".scan(/\p{Arabic}+/)
# ["\u0645\u0631\u062D\u0628\u0627", "\u0628\u0627\u0644\u0639\u0627\u0644\u0645"]
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
  • It is(/\p{Arabic}/) not working for ruby 1.8.7. I am using ruby 1.8.7 in my project. Any idea for ruby 1.8.7? – Sivananda Jan 12 '15 at 07:27
  • 1
    @Sivananda Probably not what you want to hear, but, update your Ruby version? – Yu Hao Jan 12 '15 at 12:00
  • @Sivananda Ruby 1.8.7 was [retired](https://www.ruby-lang.org/en/news/2013/06/30/we-retire-1-8-7/) over a year and a half ago. – Mark Thomas Jan 12 '15 at 12:17
  • @Yu Hao & Mark Thomas, Thanks for your response!. But My client was using old version ruby only. Is there a way convert our string into Unicode. SO that I can use this pattern [\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]. I have used "Iconv" library option ::Iconv.conv('UTF-8//IGNORE', 'UTF-8', 'لستتتثييي') its give the following output: "\331\204\330\263\330\252\330\252\330\252\330\253\331\212\331\212\331\212" – Sivananda Jan 12 '15 at 14:01
1

list of Arabic character set:

[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]

source: https://stackoverflow.com/a/11323651/3035830

Example:

arabic = "لأَبْجَدِيَّة العَرَبِيَّة - الحُرُوُفْ العَرَبِيَةُ"
#=> "لأَبْجَدِيَّة العَرَبِيَّة - الحُرُوُفْ العَرَبِيَةُ"
arabic.split(' ').each{|ab| ab.scan(/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/)}
#=> ["لأَبْجَدِيَّة", "العَرَبِيَّة", "-", "الحُرُوُفْ", "العَرَبِيَةُ"]

Now you can put the check accordingly to validate if texts are in arabic or not.

Community
  • 1
  • 1
shivam
  • 16,048
  • 3
  • 56
  • 71
  • I used above regular expression but its not working: patt = /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/ => /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]/ 1.8.7-p376 :002 > str = "هْلِهِ وَجِيْرَانِهِ وَأَنْ يَبْذُلَ كُلَّ " 1.8.7-p376 :003 > str.match(patt) => nil – Sivananda Jan 12 '15 at 06:14
  • @Sivananda If you used, why you didn't mention it in your post? – Arup Rakshit Jan 12 '15 at 06:15
  • @Sivananda I updated with some example. Can you check again? The character sets seem to work fine. – shivam Jan 12 '15 at 06:16
  • @muistooshort I have tested above example in irb its gave the following output ["\331\204\330\243\331\216\330\250\331\222\330\254\331\216\330\257\331\220\331\212\331\216\331\221\330\251", "\330\247\331\204\330\271\331\216\330\261\331\216\330\250\331\220\331\212\331\216\331\221\330\251", "-", "\330\247\331\204\330\255\331\217\330\261\331\217\331\210\331\217\331\201\331\222", "\330\247\331\204\330\271\331\216\330\261\331\216\330\250\331\220\331\212\331\216\330\251\331\217"] – Sivananda Jan 12 '15 at 06:26