I am trying to separate one Arabic sentence, Verse 38:1 of Quran, with the tm
and tokenizers
packages but they split the sentence differently into 3 and 4 words, respectively. Can someone explain (1) why this is and (2) what is the meaning of this difference from NLP and Arabic-language points of view? Also, is one of them wrong? I am by no means expert in NLP nor Arabic but trying to run the codes.
Here are the codes I tried:
library(tm)
library(tokenizers)
# Verse 38:1
verse<- "ص والقرآن ذي الذكر"
# This separates into to 3 words by tm library
a <- colnames(DocumentTermMatrix(Corpus(VectorSource(verse) )))
a
# "الذكر" "ذي" "والقرآن"
# This separates into 4 words by
b <- tokenizers::tokenize_words(verse)
b
# "ص" "والقرآن" "ذي" "الذكر"
I would expect them to be equal but they are different. Can someone explain what is going on here?