-1

I am doing text mining in R with Arabic language And use gsub function but I got an error as shown here

Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) : 
  invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
  PCRE pattern compilation error
        'character value in \x{} or \o{} is too large'
        at '}\x{0644}(?=\p{L})'

here is my code

x<-("الوطن")
# Remove leading alef lam with optional leading waw
m <- gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)

anyone can help me ?

Reem
  • 47
  • 8

2 Answers2

2

Finally I solved the problem , the problem is : when I import data in Arabic language as csv then apply gsub I get the error

    Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) : 
   invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
   In addition: Warning message:
   In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
   PCRE pattern compilation error
        'character value in \x{} or \o{} is too large'
        at '}\x{0644}(?=\p{L})'

I figure out that I need to save the data with encode= UTF-8 then read it also with encode= UTF-8 Then change the Local . like this code :

Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
>  write.csv(x, file = "x.csv" , fileEncoding = "UTF-8")

y<-read.csv("C:/Users/Documents/x.csv",encoding ="UTF-8")
> Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
Reem
  • 47
  • 8
1

it seems to me the only problem is your quotation marks:

> x <- "الوطن"
> gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
[1] "وطن"

also, check for your OS locale as I've experienced some similar issues when trying to process Hebrew text while my Windows locale was set to US.

Spätzle
  • 709
  • 10
  • 20
  • please run `Sys.getlocale()` and report back the output – Spätzle Dec 25 '18 at 10:29
  • `> Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 ` – Reem Dec 25 '18 at 11:24
  • So as I've previously said, the problem begins with your OS, you need to change locale so Arabic will be supported (windows 10: Settings>Time & Language>Region & language, select your country), then try again – Spätzle Dec 25 '18 at 11:52
  • Thank you to help me, but still the error when I change the locale to Arabic . any another way to solve this problem ? – Reem Dec 25 '18 at 12:28
  • see this: https://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r – Spätzle Dec 25 '18 at 12:33
  • `gsub` its working when I use the variable but when I import data as csv and apply `gsub` give me same error why ? – Reem Dec 25 '18 at 12:37
  • try using the `enc2utf8` function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html – Spätzle Dec 25 '18 at 12:39
  • 1
    Thank you so much @Spätzle to help me – Reem Dec 25 '18 at 13:59