Count number of "letters"/characters for non-Western language (e.g. Hindi)

Question

I want to count the number of "letters" in non-Western languages like Hindi. I put the letters in parenthesis, because if I'm not mistaken, e.g. in Mandarin a character does not necessarily represent a letter, but more like a word.

Anyway, so with Western languages, the following works:

library(stringr)
western_text <- "This is my text"
str_count(tolower(western_text), "[a-z]")

# [1] 12

Now I try the same with a Hindi response:

hindi_text <- "बहुत सी"
str_count(tolower(hindi_text), "[a-z]")

# [1] 0

So question is how I can count the letter equivalent of the Hindi (and potentially other non-Western like Mandarin, Kyrillic...) alphabet(s)?

Update: I guess I will probably need to create some sort of lookup list of all non-Western alphabets to match against?

Good source. Thanks for finding it! – deschen Dec 01 '21 at 22:15 — deschen, Dec 01 '21 at 22:15

manro · Answer 1 · 2021-12-01T22:01:05.370

1

Hindi:

hindi_text <- "बहुत सी"
str_count(hindi_text)
[1] 5

Bulgarian

bulgarian_text <- "НаРаВЯне"
str_count(tolower(bulgarian_text))
[1] 8

Amharic

amharic_text <- "ጆሮ"
str_count(amharic_text)
[1] 2

Russian

russian_text <- "солнце"
str_count(russian_text)
[1] 6

Arabic

arabic_text <- "الله"
str_count(arabic_text)
[1] 4

Right?

For insurance, you can pass additionally your string through enc2utf8.

An addition:

russian_text <- "СОлнЦЕ3333 "
str_count(tolower(russian_text), "[а-я]")
[1] 6

A new addition:

hybrid_text <- "СОлнЦЕ3333 girl "
str_count(tolower(hybrid_text), c("[а-я]", "[a-z]"))
[1] 6 4

edited Dec 01 '21 at 22:01

answered Dec 01 '21 at 21:20

manro

3,529
2
9
22

Thanks for this hint. That's a good start. However, this will count indeed the characters, not only letters, e.g. "солнце " gives 7 characters. And it also gives some incorrect results when adding numbers, e.g. "солнце0" still gives 6 as result, not 7. – deschen Dec 01 '21 at 21:44
@deschen look to an addition. We should return [a-z], but for every language separately ) – manro Dec 01 '21 at 21:53
Ok, so it basically requires to specify the respective alphabet similar to the Western [a-z]. Which makes sense. I just hoped there is some shortcut to reference any/all alphabets in one command. But I see that this is a bit naive to hope. – deschen Dec 01 '21 at 21:56
@deschen Look to the next addition ;) – manro Dec 01 '21 at 22:00
Yes yes, that‘s what I meant, I need to specify the Western, Kyrillic, Japanese….alphabet. With „one command“ I was more refeering to the question of there is a function that gives me the info about letters in any language without having to specify the alphabet/letters. – deschen Dec 01 '21 at 22:12
@deschen Look to ```nchar``` or ```stri_count``` + regex to exclude digits and punctuation marks. – manro Dec 01 '21 at 22:20

Count number of "letters"/characters for non-Western language (e.g. Hindi)

1 Answers1