dplyr filter condition to distinguish between unicode symbol and its unicode representation

Question

I am trying to filter the Symbol column based on whether it's of the form \uxxxx

This is easy visually, that is, some look like $, ¢, £, and others like \u058f, \u060b, \u07fe.

But I cannot seem to figure it out using stringi / dplyr

library(dplyr)
library(stringi)

df <- structure(list(Character = c("\\u0024", "\\u00A2", "\\u00A3", 
                             "\\u00A4", "\\u00A5", "\\u058F", "\\u060B", "\\u07FE", "\\u07FF", 
                             "\\u09F2", "\\u09F3", "\\u09FB", "\\u0AF1", "\\u0BF9", "\\u0E3F", 
                             "\\u17DB", "\\u20A0", "\\u20A1", "\\u20A2", "\\u20A3"), 
                     Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff", 
                                "৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠", 
                                "₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")

   Character Symbol
1    \\u0024      $
2    \\u00A2      ¢
3    \\u00A3      £
4    \\u00A4      ¤
5    \\u00A5      ¥
6    \\u058F \u058f
7    \\u060B \u060b
8    \\u07FE \u07fe
9    \\u07FF \u07ff
10   \\u09F2      ৲
11   \\u09F3      ৳
12   \\u09FB \u09fb
13   \\u0AF1 \u0af1
14   \\u0BF9 \u0bf9
15   \\u0E3F      ฿
16   \\u17DB      ៛
17   \\u20A0      ₠
18   \\u20A1      ₡
19   \\u20A2      ₢
20   \\u20A3      ₣

What I've tried

I have tried using variations on nchar but haven't had luck


df$Symbol %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

df$Symbol %>% stri_unescape_unicode %>% nchar
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

df$Symbol %>% stri_escape_unicode %>% nchar
# [1] 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Question

How can I filter on the Symbol column for all the rows of the form $, ¢, £ etc (and conversely for rows like \u058f, \u060b, \u07fe)?

@vpz I haven't, no. I reasoned there would be some 'more formal' way of doing it, but will gladly use regex if it works reliably! — stevec, Mar 19 '20 at 01:32
Does the character representation have some pattern for the symbols? — vpz, Mar 19 '20 at 01:41
@vpz the only info is what's contained in the `Symbol` column (I feel like it *should* be enough, but I can't work out how to distinguish - which is interesting because it's so easy for human eyes to see) — stevec, Mar 19 '20 at 01:43
Are all those valid unicodes? You can try the solution here - https://stackoverflow.com/questions/30794201/search-for-unicode-values-in-character-string — Ronak Shah, Mar 19 '20 at 02:02
@RonakShah thanks I’ll read up. It’s possible they are some superset of Unicode or also possible that Unicode is the wrong term, and a third possibility is that the Symbol column is a mixture of Unicode and something else. — stevec, Mar 19 '20 at 02:04
All of those symbols are valid unicode. So as you might have gathered from H1's answer, the result will be completely font dependent. — thc, Mar 26 '20 at 00:38
@thc thanks, how are you able to tell, is there a function in R that returns `TRUE`/`FALSE` as to whether it's valid unicode, or is there another way? — stevec, Mar 26 '20 at 00:53
You can use `utf8::utf8_valid()` but this may not distinguish between existing valid unicode and unicode that is valid but unassigned. Can you expand a little on what you're ultimately trying to achieve? — Ritchie Sacramento, Mar 26 '20 at 03:20
@stevec I looked up each character code online, but the utf8 function H1 mentioned is probably better :p — thc, Mar 26 '20 at 17:35

Ritchie Sacramento · Accepted Answer · 2020-03-23T04:19:26.450

Edit:

The function glyphs_match() from the gdtools package is designed for this, however, using it didn't quite return the expected result. I'm using Lucida Console as my font and obtain the following output when using glyphs_match(). There seems to be one glyph that isn't rendered but for which the function returns TRUE. Perhaps other users can explain why that is the case.

df$glyph_match <- gdtools::glyphs_match(df$Symbol, fontfile = "C:\\WINDOWS\\Fonts\\lucon.TTF")
    df

   Character   Symbol glyph_match
1    \\u0024        $        TRUE
2    \\u00A2        ¢        TRUE
3    \\u00A3        £        TRUE
4    \\u00A4        ¤        TRUE
5    \\u00A5        ¥        TRUE
6    \\u058F <U+058F>       FALSE
7    \\u060B <U+060B>       FALSE
8    \\u07FE <U+07FE>       FALSE
9    \\u07FF <U+07FF>       FALSE
10   \\u09F2 <U+09F2>       FALSE
11   \\u09F3 <U+09F3>       FALSE
12   \\u09FB <U+09FB>       FALSE
13   \\u0AF1 <U+0AF1>       FALSE
14   \\u0BF9 <U+0BF9>       FALSE
15   \\u0E3F <U+0E3F>       FALSE
16   \\u17DB <U+17DB>       FALSE
17   \\u20A0 <U+20A0>       FALSE
18   \\u20A1        ¢        TRUE
19   \\u20A2 <U+20A2>       FALSE
20   \\u20A3 <U+20A3>        TRUE

Earlier answer - may only work on Windows:

There will be variation depending on your font/system, for example, when running your code my output doesn't match what you've provided:

df <- structure(list(Character = c("\\u0024", "\\u00A2", "\\u00A3", 
                             "\\u00A4", "\\u00A5", "\\u058F", "\\u060B", "\\u07FE", "\\u07FF", 
                             "\\u09F2", "\\u09F3", "\\u09FB", "\\u0AF1", "\\u0BF9", "\\u0E3F", 
                             "\\u17DB", "\\u20A0", "\\u20A1", "\\u20A2", "\\u20A3"), 
                     Symbol = c("$", "¢", "£", "¤", "¥", "\u058f", "\u060b", "\u07fe", "\u07ff", 
                                "৲", "৳", "\u09fb", "\u0af1", "\u0bf9", "฿", "៛", "₠", 
                                "₡", "₢", "₣")), row.names = c(NA, 20L), class = "data.frame")

df
   Character   Symbol
1    \\u0024        $
2    \\u00A2        ¢
3    \\u00A3        £
4    \\u00A4        ¤
5    \\u00A5        ¥
6    \\u058F <U+058F>
7    \\u060B <U+060B>
8    \\u07FE <U+07FE>
9    \\u07FF <U+07FF>
10   \\u09F2 <U+09F2>
11   \\u09F3 <U+09F3>
12   \\u09FB <U+09FB>
13   \\u0AF1 <U+0AF1>
14   \\u0BF9 <U+0BF9>
15   \\u0E3F <U+0E3F>
16   \\u17DB <U+17DB>
17   \\u20A0 <U+20A0>
18   \\u20A1        ¢
19   \\u20A2 <U+20A2>
20   \\u20A3 <U+20A3>

But one crude way of capturing if the glyph exists is:

 nchar(capture.output(cat(df$Symbol, sep = "\n"))) == 1

[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18]  TRUE FALSE FALSE

So the glyphs can be filtered by:

library(dplyr)

df %>%
  filter(nchar(capture.output(cat(Symbol, sep = "\n"))) == 1)

  Character Symbol
1   \\u0024      $
2   \\u00A2      ¢
3   \\u00A3      £
4   \\u00A4      ¤
5   \\u00A5      ¥
6   \\u20A1      ¢

jared_mamrot · Answer 2 · 2020-03-23T01:54:13.770

Use as.character.POSIXt to 'render' symbols and pad with spaces. Unicode characters in the form "\uxxxx" will be printed as a single character and all others will be larger; then you can filter according to length:

# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 2)

# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) == 1)

If you have a long string as a 'symbol' (e.g. "aaaaaaaaaa₣") the padding will be increased and need to be accounted for e.g.

# To keep 'single char' symbols e.g. "$":
df %>% filter(nchar(as.character.POSIXt(Symbol)) >= 11)

# Or for 'unicode format' symbols e.g. "\u07fe":
df %>% filter(nchar(as.character.POSIXt(Symbol)) <= 10)

dplyr filter condition to distinguish between unicode symbol and its unicode representation

What I've tried

Question

2 Answers2