1

I want to count the number of "letters" in non-Western languages like Hindi. I put the letters in parenthesis, because if I'm not mistaken, e.g. in Mandarin a character does not necessarily represent a letter, but more like a word.

Anyway, so with Western languages, the following works:

library(stringr)
western_text <- "This is my text"
str_count(tolower(western_text), "[a-z]")

# [1] 12

Now I try the same with a Hindi response:

hindi_text <- "बहुत सी"
str_count(tolower(hindi_text), "[a-z]")

# [1] 0

So question is how I can count the letter equivalent of the Hindi (and potentially other non-Western like Mandarin, Kyrillic...) alphabet(s)?

Update: I guess I will probably need to create some sort of lookup list of all non-Western alphabets to match against?

deschen
  • 10,012
  • 3
  • 27
  • 50

1 Answers1

1

Hindi:

hindi_text <- "बहुत सी"
str_count(hindi_text)
[1] 5

Bulgarian

bulgarian_text <- "НаРаВЯне"
str_count(tolower(bulgarian_text))
[1] 8

Amharic

amharic_text <- "ጆሮ"
str_count(amharic_text)
[1] 2

Russian

russian_text <- "солнце"
str_count(russian_text)
[1] 6

Arabic

arabic_text <- "الله"
str_count(arabic_text)
[1] 4

Right?

For insurance, you can pass additionally your string through enc2utf8.


An addition:

russian_text <- "СОлнЦЕ3333 "
str_count(tolower(russian_text), "[а-я]")
[1] 6

A new addition:

hybrid_text <- "СОлнЦЕ3333 girl "
str_count(tolower(hybrid_text), c("[а-я]", "[a-z]"))
[1] 6 4
manro
  • 3,529
  • 2
  • 9
  • 22
  • Thanks for this hint. That's a good start. However, this will count indeed the characters, not only letters, e.g. "солнце " gives 7 characters. And it also gives some incorrect results when adding numbers, e.g. "солнце0" still gives 6 as result, not 7. – deschen Dec 01 '21 at 21:44
  • @deschen look to an addition. We should return [a-z], but for every language separately ) – manro Dec 01 '21 at 21:53
  • Ok, so it basically requires to specify the respective alphabet similar to the Western [a-z]. Which makes sense. I just hoped there is some shortcut to reference any/all alphabets in one command. But I see that this is a bit naive to hope. – deschen Dec 01 '21 at 21:56
  • @deschen Look to the next addition ;) – manro Dec 01 '21 at 22:00
  • Yes yes, that‘s what I meant, I need to specify the Western, Kyrillic, Japanese….alphabet. With „one command“ I was more refeering to the question of there is a function that gives me the info about letters in any language without having to specify the alphabet/letters. – deschen Dec 01 '21 at 22:12
  • @deschen Look to ```nchar``` or ```stri_count``` + regex to exclude digits and punctuation marks. – manro Dec 01 '21 at 22:20