How to handle example data in R Package that has UTF-8 marked strings

Question

I would like to include an example dataset (of Twitter tweets and metadata) in an R Package I'm writing.

I downloaded an example data.frame using the Twitter API and saved it as .RData (with the corresponding .R data description file) in my package.

When I run R CMD Check, I get the following NOTE,

 * checking data for non-ASCII characters ... NOTE
 Note: found 287 marked UTF-8 strings

I tried saving the data.frame with ASCII=TRUE, hoping this would fix the problem. But it persists. Any idea on how I can get R CMD CHECK to run without notes?

(also, I would be open to removing all UTF-8 marked strings from the example data if that's the solution). Thank you!

example row from data.frame:

First time in SF (@ San Francisco International Airport (SFO) - @flysfo in San Francisco, CA) https://t.co/1245xqxtwesr
  favorited favoriteCount replyToSN             created truncated replyToSID                 id replyToUID
1     FALSE             0      <NA> 2015-03-13 23:30:35     FALSE       <NA> 576525795927179264       <NA>
                                                   statusSource screenName retweetCount isRetweet retweeted
1 <a href="http://foursquare.com" rel="nofollow">Foursquare</a>  my_name93            0     FALSE     FALSE
      longitude    latitude
1 -122.38100052 37.61865062

It looks like you need to paste `"/@href"` to your xpath query, or `XML::xmlGetAttr(a, "href")` on the node `a`. Using `as(statusSource, "character")` may also work. But can we see the code of call you made to get the original data? — Rich Scriven, Mar 14 '15 at 06:33

score 12 · Accepted Answer · edited Feb 08 '21 at 15:35

In case it's useful to anyone in the future, the resolution I found is this:

The UTF-8 marked characters were in the dataset because Twitter tweets sometimes include emoji's.

The advice I was given is that there isn't a straightforward way to get rid of the NOTE in the PACKAGE CMD CHECK without just removing all of the UTF-8 marked strings.

To do this, I used the command:

nonUTF <- iconv(df$TroubleVector, from="UTF-8", to="ASCII")

on the vector that had emoji's, etc. This command returned NA if the value had UTF-8 marked strings. I used this to subset the dataset - now I get a clean build.

stevec · Answer 2 · 2020-12-18T02:45:53.277

0

I did a google search for a UTF-8 to ASCII online converter, pasted in my code, converted it, and pasted it back into my script.

Since this answer was downvoted, it drew my attention to the original site I linked to being non-performant. So I removed that specific link from the answer. If you stumble upon a bad one, use another on as there are many available in the top results in google.

edited Dec 18 '20 at 02:45

answered May 25 '19 at 17:24

stevec

41,291
27
223
311

score 0 · Answer 3 · answered Mar 24 '21 at 08:01

The stringi::stri_enc_toascii() in stringi package solves my problem in my package developement.

> stringi::stri_enc_isascii(a)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> b <- stringi::stri_enc_toascii(a)
There were 50 or more warnings (use warnings() to see the first 50)
> stringi::stri_enc_isascii(b)
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [22] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [64] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[148] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

How to handle example data in R Package that has UTF-8 marked strings

3 Answers3

Linked