How to remove Arabic text from string

Question

I have copied some data describing cholera cases in regions of Yemen from an online database into a text file. The names of each region are given in both English and Arabic in a single string. I would like to remove the Arabic in R, and be left with just the English names.

This is what the English/Arabic string looks like when read into R:

regions <- c("Al Hudaydah Ø§Ù„ØØ¯ÙŠØ¯Ø©", "Hajjah ØØ¬Ø©")

I would like to be left with just the English "Al Hudaydah" "Hajjah"

I have tried using str_replace_all(regions, "[^[:alnum:]]", "") and replace_non_ascii(regions) but it doesn't give me what I'm looking for.

Any ideas?

Thanks!

https://rdrr.io/cran/arabicStemR/man/removePrefixes.html did u try that — Hani Gotc, Mar 04 '21 at 12:46
this post might be useful https://stackoverflow.com/questions/43049015/removing-text-containing-non-english-character — Tluther, Mar 04 '21 at 12:48

score 1 · Answer 1 · answered Mar 04 '21 at 13:01

1

The simplest approach may be to simply use gsub

gsub("[^A-Za-z0-9 ]", "", regions)

answered Mar 04 '21 at 13:01

Hugh

15,521
12
57
100

score 1 · Answer 2 · answered Mar 05 '21 at 15:31

Edit: I have found the solution to my problem. The issue was in the reading in of the text file. If it contains Arabic (or presumably any non-latin scripts), you need to use encoding = 'UTF-8'

e.g.

txt <- readLines("Arabic_English_script.txt") returns

"Al Hudaydah Ø§Ù„ØØ¯ÙŠØ¯Ø©" "Taizz ØªØ¹Ø²"

whereas txt <- readLines("Arabic_English_script.txt", encoding = 'UTF-8') returns

"Al Hudaydah الحديدة" "Taizz تعز"

Once the text has been properly imported, then gsub("[^[:alnum:]]", "", txt) returns

"AlHudaydah" "Taizz"

(Note, it still removes the spaces. Not sure how to fix that one.)

I *think* `"[^[:alnum:] ]"` will work (add a space after the right bracket: this would be "remove all characters that are not alphanumeric **or space**" ) Note this is similar to @Hugh's answer (slightly better because of the remote chance of using a locale where there are letters outside the range `A-Z`, e.g. Estonian) — Ben Bolker, Mar 05 '21 at 15:42

How to remove Arabic text from string

2 Answers2