1

I am preparing a dataset that contains CJK characters with R and mostly through Tidyverse. During the process, I found that some character elements has \037 at the very end.

# A tibble: 99 × 2
     Prefecture     n
            <chr> <int>
1            \037     1
2      北海道\037     1
3          北海道    13
4          北海道     4
...          ...     ...

I have tried to remove them with the line below:

library(stringr)
out.file %>% mutate(
    Prefecture = str_replace_all(out.file$Prefecture, "\\\\037", "")
)

The str_replace_all does remove all the \037s when being tested on a string. When applying mutate on an entire column, however, the lines above still gives the same results in the first code chunk in this post.

What would be the most efficient way to remove them from strings?

Update with solution

require(stringi)
out.file %>% 
mutate(Prefecture = stri_escape_unicode(Prefecture), 
       Prefecture = str_replace_all(Prefecture, "\037", ""),
       Prefecture = stri_unescape_unicode(Prefecture))

This way I am able to resolve the issue successfully.

Carl H
  • 1,036
  • 2
  • 15
  • 27
  • 1
    This may help: http://stackoverflow.com/a/25466734/1000343 – Tyler Rinker Apr 07 '17 at 17:49
  • 1
    Thanks! @TylerRinker. That was a helpful post, I was able to escape from `CJK`, replace the unwanted characters, and unescape them all. This solves my issue. – Carl H Apr 07 '17 at 18:39

0 Answers0