2

I have copied some data describing cholera cases in regions of Yemen from an online database into a text file. The names of each region are given in both English and Arabic in a single string. I would like to remove the Arabic in R, and be left with just the English names.

This is what the English/Arabic string looks like when read into R:

regions <- c("Al Hudaydah الحديدة", "Hajjah حجة")

I would like to be left with just the English "Al Hudaydah" "Hajjah"

I have tried using str_replace_all(regions, "[^[:alnum:]]", "") and replace_non_ascii(regions) but it doesn't give me what I'm looking for.

Any ideas?

Thanks!

2 Answers2

1

The simplest approach may be to simply use gsub

gsub("[^A-Za-z0-9 ]", "", regions)
Hugh
  • 15,521
  • 12
  • 57
  • 100
1

Edit: I have found the solution to my problem. The issue was in the reading in of the text file. If it contains Arabic (or presumably any non-latin scripts), you need to use encoding = 'UTF-8'

e.g.

txt <- readLines("Arabic_English_script.txt") returns

"Al Hudaydah الحديدة" "Taizz تعز"

whereas txt <- readLines("Arabic_English_script.txt", encoding = 'UTF-8') returns

"Al Hudaydah الحديدة" "Taizz تعز"

Once the text has been properly imported, then gsub("[^[:alnum:]]", "", txt) returns

"AlHudaydah" "Taizz"

(Note, it still removes the spaces. Not sure how to fix that one.)

  • I *think* `"[^[:alnum:] ]"` will work (add a space after the right bracket: this would be "remove all characters that are not alphanumeric **or space**" ) Note this is similar to @Hugh's answer (slightly better because of the remote chance of using a locale where there are letters outside the range `A-Z`, e.g. Estonian) – Ben Bolker Mar 05 '21 at 15:42