turn non-ascii character into their unicode form

Question

I'm writing a package function where I want to check if some text contains any of the following characters:

äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ

Problem is that devtools::check() returns a warning:

W checking R files for non-ASCII characters ... Found the following file with non-ASCII characters: gb_data_prepare.R Portable packages must use only ASCII characters in their R code, except perhaps in comments. Use \uxxxx escapes for other characters.

So I tried to convert these characters into unicode, but actually don't really know how.

stringi::stri_encode("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", to = "Unicode")

Error in stringi::stri_encode(x, to = "Unicode") : 
  embedded nul in string: '\\xff\\xfe\\xe4'

doesn't work. Same with

iconv("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", from = "UTF-8", to = "Unicode")

Error in iconv(x, from = "UTF-8", to = "Unicode") : 
  unsupported conversion from 'UTF-8' to 'Unicode' in codepage 1252

Any ideas what I can do?

Note: weird thing also is that if I do:

x <- "äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ"

x now returns "äöüßâçèéêîôûaecslnózzžšueráúëïùÄÖÜSSÂÇÈÉÊÎÔÛAECSLNÓZZŽŠUERÁÚËÏÙ" which is wrong. So I guess it also has something to do with my general R encoding?

There is no "Unicode" encoding. There are `UTF-8` and `UTF-16` and `UTF-32` encoding which use different number of bytes to store unicode code points. Try `cat(stringi::stri_escape_unicode("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ"))` to get the version of the string using the escape characters and include that version of the string in your package. Are you on Windows? Windows uses LATIN-1 encoding by default rather than UTF-8 — MrFlick, Mar 30 '22 at 20:58
Yes i'm on Windows. Trying your code gives me the unicode versions of most of the characters, but not for all, i.e. it returns `\u00e4\u00f6\u00fc\u00df\u00e2\u00e7\u00e8\u00e9\u00ea\u00ee\u00f4\u00fbaecsln\u00f3zz\u017e\u0161uer\u00e1\u00fa\u00eb\u00ef\u00f9\u00c4\u00d6\u00dcSS\u00c2\u00c7\u00c8\u00c9\u00ca\u00ce\u00d4\u00dbAECSLN\u00d3ZZ\u017d\u0160UER\u00c1\u00da\u00cb\u00cf\u00d9` Looking at e.g. the 13th character, it seems to be returned as "a", and doesn't give a unicode version of "ą". Sorry, this whole encoding topic is a nightmare to me... — deschen, Mar 30 '22 at 21:05
This seems like a windows issue. If you are copy/pasting values the OS is probably trying to convert to LATIN-1. I think if you use `s <- "\u00e4\u00f6\u00fc\u00df\u00e2\u00e7\u00e8\u00e9\u00ea\u00ee\u00f4\u00fb\u0105\u0119\u0107\u015b\u0142\u0144\u00f3\u017c\u017a\u017e\u0161\u016f\u011b\u0159\u00e1\u00fa\u00eb\u00ef\u00f9\u00c4\u00d6\u00dcSS\u00c2\u00c7\u00c8\u00c9\u00ca\u00ce\u00d4\u00db\u0104\u0118\u0106\u015a\u0141\u0143\u00d3\u017b\u0179\u017d\u0160\u016e\u011a\u0158\u00c1\u00da\u00cb\u00cf\u00d9"` you will get the string you want. — MrFlick, Mar 30 '22 at 21:10
Thanks! I'm curious, do you get this result with your stri_escape_unicode code? Or did you manually look up the unicode representation? And any idea if there's a way to prevent the OS issue of Windows converting to latin1? — deschen, Mar 30 '22 at 21:13
I read in the string from a file specific the correct encoding. Then I used `stri_escape_unicode`. The easiest way to always use unicode is to switch to a mac or linux I'm afraid. — MrFlick, Mar 30 '22 at 21:15
`cat(stringi::stri_escape_unicode(x))` works as expected for me on Windows 10 providing that `stringi::stri_enc_get()` returns `"UTF-8"`; I have set [*Beta: Use Unicode UTF-8 for worldwide language support*](https://stackoverflow.com/questions/56419639/)… (`R version 4.1.0`, Platform: `x86_64-w64-mingw32/x64 (64-bit)`). — JosefZ, Mar 31 '22 at 15:13
Both of these settings are set, i.e. stri_enc_get(9 returns URF-8 and the beta unicode support in Windows is also enabled. — deschen, Mar 31 '22 at 17:45
Not sure if it could be related to the locale settings? which is `"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"` in my case. — deschen, Mar 31 '22 at 17:47

turn non-ascii character into their unicode form

0 Answers0