Split String with second (single) Backslash / R Emojis (Unicode) without Modifier

Question

I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.

Here is what I tried:

df <- tibble::tribble(
  ~RUser, ~REmoji,
  "User1", "\U0001f64f\U0001f3fb",
  "User2", "\U0001f64f",
  "User2", "\U0001f64f\U0001f3fc"
)

df %>% mutate(newcol = gsub("\\\\*", "", REmoji))

I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.

The result should look like this output:

df2 <- tibble::tribble(
  ~RUser, ~REmoji1, ~newcol,
  "User1", "\U0001f64f", "\U0001f3fb",
  "User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
  "User2", "\U0001f64f", "\U0001f3fc"
)

Thanks a lot!

Maybe `df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))`? — Wiktor Stribiżew, Jun 30 '21 at 13:24

score 2 · Answer 1 · answered Jun 30 '21 at 17:32

2

We could also use substring from base R

df$newcol <- substring(df$REmoji, 2)

answered Jun 30 '21 at 17:32

akrun

874,273
37
540
662

1

I think they want to extract the first character not remove it: `substr(df$REmoji, 1, 1)` – GKi Jul 01 '21 at 06:13

Wiktor Stribiżew · Accepted Answer · 2021-06-30T16:53:41.143

1

Note these \U... are single Unicode code points, not just a backslash + digits/letters.

Using the ^. PCRE regex with sub provides the expected results:

> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
  RUser REmoji                 newcol      
  <chr> <chr>                  <chr>       
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f"           ""          
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"

Make sure you pass the perl=TRUE argument.

And in order to do the reverse, i.e. keep the first code point only, you can use:

df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))

edited Jun 30 '21 at 16:53

answered Jun 30 '21 at 13:30

Wiktor Stribiżew

607,720
39
448
563

Thank you! That correctly outputs the second part of the string in its own column. But what would be the smartest way to get only the front part. Because in the end I don't need the second part, only the front one. – Alex_ Jun 30 '21 at 13:40
1

@Alex_ `df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))` will keep the first code point. – Wiktor Stribiżew Jun 30 '21 at 13:41
1

@Alex_ Does it work as expected now? Note my output is exactly as you asked for in df2. – Wiktor Stribiżew Jun 30 '21 at 13:48

Split String with second (single) Backslash / R Emojis (Unicode) without Modifier

2 Answers2