Replace unicode by its value in a dataframe

Question

I tried to replace the unicode "U+00F3" from a data frame with the sapply function but nothing happened. The unicode part I want to replace is a chr type.

Here the function :

tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U+00F3>", replacement= "o")

EDIT :

Thanks to the answer of Cath below, I added before the + : \\

tableExcel$Team <- sapply(tableExcel$Team, gsub, pattern = "<U\\+00F3>", replacement= "o")

But it didn't work.

I also tried to provide an exemple of my dataset but the problem is that it works on it and not on mine :

tableExcel <- data.frame("Team" = c("A", "B", "C", "Reducci<U+00F3>n"), "Point" = c(2, 30, 40, 30))
tableExcel$Team <- as.character(tableExcel$Team)

To provide more information, here the importation of my excel file:

tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))

The structure of my data :

structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), 2019-S01 = c(0, 0, 50, 0, NA, NA), 2019-S02 = c(0, 0, 10, 10, NA, NA), 2019-S03 = c(93, 88, 46, 19, NA, NA), 2019-S04 = c(56, 48, 0, 0, 13, 13), 2019-S05 = c(NA, NA, 80.5, 49.5, 42, 28.5), 2019-S06 = c(NA, NA, 66, 48, 55, 39.5), 2019-S07 = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame")

you need to add `fixed=TRUE` for `gsub` or escape special characters (`""` instead of `""`) as `+` means *"one or more times"* in regex and not jsut "+", see `?regex` — Cath, Jun 27 '19 at 06:40
I tried like that but it didn't work : ```tableExcel$Team <- gsub("", "o", tableExcel$Team,fixed = TRUE)``` — Forzan, Jun 27 '19 at 06:44
either escape OR put `fixed=TRUE`: your line is searching for "" as is — Cath, Jun 27 '19 at 06:45
then we will need a little more information... because `gsub("", "o", "aaa")` gives `[1] "aaao"` as expected. (< and > do not need to be escaped) — Cath, Jun 27 '19 at 06:49
Ok it works for me as well but when I tried in the data frame nothing happen : ```tableExcel <- as.data.frame(read_excel("Dataset LOS.xls", sheet = "Liga Squads"))``` — Forzan, Jun 27 '19 at 06:53
As I haven't your data and thus cannot know how they look like, it is quite difficult for me to just guess... Please see how to make a [mcve] — Cath, Jun 27 '19 at 06:57
`gsub("", "o", tableExcel$Team)`: `[1] "A" "B" "C" "Reduccion"`... — Cath, Jun 27 '19 at 07:11
Oups I forget to put the \\. Now i don't understand becquse it works here but not in my real code ... and i didn't forget anything — Forzan, Jun 27 '19 at 07:16
I reopen your question as the problem seems to not be only the escaping of "+" sign (though it was the orginal question). However, like it is right now, we cannot reproduce the error, hence not help you — Cath, Jun 27 '19 at 07:49
As the problem you mentioned is not reproducible from the example you shared, it is difficult to help you. Please try to add `dput(head(tableExcel))` data from your actual data. — Ronak Shah, Jun 27 '19 at 07:54
structure(list(Team = c("CHURN", "CHURN", "RESIDENCIAL NPTB", "RESIDENCIAL NPTB", "AUDIENCIAS TV", "AUDIENCIAS TV"), Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA), `2019-S02` = c(0, 0, 10, 10, NA, NA), `2019-S03` = c(93, 88, 46, 19, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13, 13), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5), `2019-S06` = c(NA, NA, 66, 48, 55, 39.5), `2019-S07` = c(131, 112, 103, 63, 40.5, 38)), row.names = c(1L, 2L, 4L, 5L, 7L, 8L), class = "data.frame") — Forzan, Jun 27 '19 at 08:06
in your data.frame whose structure you gave in the comment above (and which should be in the main Q), there is no unicode bit... — Cath, Jun 27 '19 at 10:36
Please [edit] your question to include your data where it can be formatted and is more noticeable instead of in a comment — camille, Jun 27 '19 at 13:57

score 2 · Accepted Answer · answered Jun 27 '19 at 09:43

I'm unable to replicate the issue with gsub. The following works as expected:

tableExcel$Team <- gsub("<U\\+00F3>", "o", tableExcel$Team)

#### OUTPUT ####

              Team  Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1 Reducci<U+00F1>n P. Asig        0        0       93       56       NA       NA    131.0
2            CHURN P. entr        0        0       88       48       NA       NA    112.0
4 Reducci<U+00F2>n P. Asig       50       10       46        0     80.5     66.0    103.0
5 RESIDENCIAL NPTB P. entr        0       10       19        0     49.5     48.0     63.0
7    AUDIENCIAS TV P. Asig       NA       NA       NA       13     42.0     55.0     40.5
8             <NA> P. entr       NA       NA       NA       13     28.5     39.5     38.0
9        Reduccion P. entr       NA       NA       NA       NA       NA       NA       NA

However, replacement using regular expressions might not be the most efficient way convert the unicode characters, as this would require multiple calls to gsub. Instead, you might want to give stringi's stri_unescape_unicode() a try:

# install.packages("stringi") # Use if not yet installed.
library(stringi)

tableExcel$Team <- stri_unescape_unicode(gsub("<U\\+(.*)>", "\\\\u\\1", tableExcel$Team))

#### OUTPUT ####

              Team  Points 2019-S01 2019-S02 2019-S03 2019-S04 2019-S05 2019-S06 2019-S07
1        Reducciñn P. Asig        0        0       93       56       NA       NA    131.0
2            CHURN P. entr        0        0       88       48       NA       NA    112.0
4        Reducciòn P. Asig       50       10       46        0     80.5     66.0    103.0
5 RESIDENCIAL NPTB P. entr        0       10       19        0     49.5     48.0     63.0
7    AUDIENCIAS TV P. Asig       NA       NA       NA       13     42.0     55.0     40.5
8             <NA> P. entr       NA       NA       NA       13     28.5     39.5     38.0
9        Reducción P. entr       NA       NA       NA       NA       NA       NA       NA

The format <U+0000> is first converted to \\u0000 using gsub and then unescaped. As you can see, it takes care of multiple unicode characters in one go, which makes things much simpler.

Data:

tableExcel <- structure(list(Team = c("Reducci<U+00F1>n", "CHURN", "Reducci<U+00F2>n", 
"RESIDENCIAL NPTB", "AUDIENCIAS TV", NA, "Reducci<U+00F3>n"), 
    Points = c("P. Asig", "P. entr", "P. Asig", "P. entr", "P. Asig", 
    "P. entr", "P. entr"), `2019-S01` = c(0, 0, 50, 0, NA, NA, 
    NA), `2019-S02` = c(0, 0, 10, 10, NA, NA, NA), `2019-S03` = c(93, 
    88, 46, 19, NA, NA, NA), `2019-S04` = c(56, 48, 0, 0, 13, 
    13, NA), `2019-S05` = c(NA, NA, 80.5, 49.5, 42, 28.5, NA), 
    `2019-S06` = c(NA, NA, 66, 48, 55, 39.5, NA), `2019-S07` = c(131, 
    112, 103, 63, 40.5, 38, NA)), row.names = c(1L, 2L, 4L, 5L, 
7L, 8L, 9L), class = "data.frame")

Thanks for your help ! I took your code but it didn't do anything on my dataset.. — Forzan, Jun 27 '19 at 11:21
Ok It seems R cloud is not working very well, it's the second time that it didn't take in count what I'm doing. Anyway, now it works thanks ! — Forzan, Jun 27 '19 at 11:56

Replace unicode by its value in a dataframe

1 Answers1

Data: