0

I would like to rename() or combine() different speakers' names in the same observation. For example, I have a variable called "speaker" with several speakers' names with Lithuanian characters. When I try to put observations together in one name, it does not work when the name has Lithuanian alphabet characters. I guess that the alphabet is the problem because it works well with names without these Lithuan alphabet characters.

For example:

lithu_comb[lithu_comb$speaker == "Č. Juršėnas L Ų", ] <- "Č. Juršėnas"

lithu_comb <- lithu_comb[!(lithu_comb$speaker=="Ąž  Tė. T S  Ąžė K Ų    Ū  Ū.S  Ąžė  Ū  Į Ką  Ū       Žū Ė J     . Są  Įų Į  Ė   S  Ąš Į  Ų Ųų"

In the first one, I try to combine the observations because it is the same speaker, but the names are badly written. In the second case, I try to drop the observations because this is not a real speaker name.

The code does not work in both cases but works well with no Lithuanian alphabet.

Thank you very much for any feedback or advice, and sorry in advance if I made any mistake in the post.

Alberto

jay.sf
  • 60,139
  • 8
  • 53
  • 110

1 Answers1

0

Solution: Update R to version 4.2.0 or later.

Older R versions in Windows cannot deal with many special characters since they do not yet support UTF-8 encoding. R versions 4.2.0 and later should have full support for UTF-8.

Therefore, this code runs fine on my windows machine:

lithu_comb <- data.frame(speaker = c("Č. Juršėnas L Ų", "Č. Juršėnas"))
lithu_comb[lithu_comb$speaker == "Č. Juršėnas L Ų", ] <- "Č. Juršėnas"

output:

      speaker
1 Č. Juršėnas
2 Č. Juršėnas
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Let us know if that solved your problem. If not, please share your session information

sessionInfo()
Leon Samson
  • 405
  • 3
  • 12
  • I have tried it updating my session. It still works for names with non "Lithuanian alphabet", but not with names like the example. I ran the code, and it seemed to be working, but then I ran `data.frame(table(lithu_comb$speaker))` to check the changes, and they still appeared like different speakers. Any other thoughts? My session information: `R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)` – Alberto Sep 30 '22 at 17:43
  • can you share the information about your locale? its additional output from sessionInfo. Does it say you are using utf-8? Furthermore, do you have problems with the minimal example that I used? Because the code below works for me, I cannot reproduce your problem: `lithu_comb <- data.frame(speaker = c("Č. Juršėnas L Ų", "Č. Juršėnas"))` `lithu_comb[lithu_comb$speaker == "Č. Juršėnas L Ų", ] <- "Č. Juršėnas"` `data.frame(table(lithu_comb$speaker))` – Leon Samson Sep 30 '22 at 18:54
  • This is my full SessionInfo: `R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043) Matrix products: default locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C [5] LC_TIME=Spanish_Spain.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] utf8_1.2.2 readr_2.1.2 stringi_1.7.8 data.table_1.14.2 dplyr_1.0.10` – Alberto Oct 02 '22 at 08:23
  • I repeat here my code process, so maybe you can find any other issues: `#load data lithu_comb[lithu_comb$speaker == "A. Kubiliu", ] <- "Andrius Kubilius" lithu_comb[lithu_comb$speaker == "D. A. Barakauskas A", ] <- "D. A. Barakauskas" lithu_comb[lithu_comb$speaker == "Č. Stankevičius", ] <- "Č. V. Stankevičius" lithu_comb[lithu_comb$speaker == "Č. Juršėnas L Ų", ] <- "Č. Juršėnas" data.frame(table(lithu_comb$speaker))` – Alberto Oct 02 '22 at 08:27
  • When I run your code, the first two speakers work well, and I can see only one name with the sum of the observations. But, when I look at the second two names with the Lithuanian alphabet, they still appear as two different speakers. I hope this helps more to understand the problem. – Alberto Oct 02 '22 at 08:27
  • It would make it easier for us to help you if you edit your post and create a [minimal reproducible example](https://stackoverflow.com/a/5963610/11856430). Show a small dataset with summary output when printed in the console, your code for filtering and your output. Show us the output of which names get filtered properly and which do not. Thank you! :) Furthermore, I think your locale is still set to [Windows.1252](https://en.wikipedia.org/wiki/Windows-1252) and not UTF-8, but with the information you gave me I am not sure yet if that is the problem. – Leon Samson Oct 07 '22 at 08:09