Unable to remove the foreign languages unicode codes

Question

I have a .csvfile with multiple foreign languages(russian, japanese, arabic,etc) info within. For example, a column entry look like this:<U+03BA><U+03BF><U+03C5>.I want to remove rows which have this kind of info. I tried various solutions for, all of them with no result:

test_fb5 <- read_csv('test_fb_data.csv', encoding = 'UTF-8')

or applied for a column:

gsub("[<].*[>]", "")` or `sub("^\\s*<U\\+\\w+>\\s*", "")

or

gsub("\\s*<U\\+\\w+>$", "")

It seems that R 4.1.0 doesn't find the respective chars. I cannot find a way to attach a small chunk of file here. Here is the capture of the file:

                        address
33085                                           9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta
33086                                 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec
33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region
33088                                       390 progress ave unit 2 m1p 2z6 toronto on canada ontario
                                                     name
33085                                md legals canada inc
33086                             les aspirateurs jpg inc
33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis
33088                    wrench it up plumbing mechanical
                                                               category
33085 general practice attorneys divorce  family law attorneys notaries
33086                                                              <NA>
33087               mediterranean restaurants fish  seafood restaurants
33088            plumbing services damage restoration  mold remediation
             phone
33085  17808512828
33086  14507781003
33087 302298072134
33088  14168005050

the 3308's are the rows of the dataset Thank you for your time!

You can share datasets with `dput(YOUR_DATASET)` or smaller samples with `dput(head(YOUR_DATASET))`. (See [this answer](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610) for some great advice.) — ktiu, Jun 22 '21 at 16:05
@ktiu, I do not know how to use dput(). I typed in console with myfile, 6 rows in. Nothing happens..... — myhy, Jun 22 '21 at 16:23
@ktiu I typed in console dput(head(test_fb5) and print the respective rows in console. How to add the rows to my question..... — myhy, Jun 22 '21 at 16:30
Edit the question and paste from your console into a code block — ktiu, Jun 22 '21 at 16:31

Chris Ruehlemann · Answer 1 · 2021-06-22T16:51:07.653

0

You can use a negative character class to remove the <U...> codes:

gsub("<[^>]+>", "", x)

This matches any substring that:

starts with <,
is followed one or more times by any character except the > character, and
ends on >

If you have other substrings between <and >, which you do not want to remove, just add U to more specifically target unicode codes, thus: <U[^>]+>

Data:

x <- "address 33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta 33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec 33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region 33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario name 33085 md legals canada inc 33086 les aspirateurs jpg inc 33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis 33088  wrench it up plumbing mechanical category 33085 general practice attorneys divorce  family law attorneys notaries 33086                                                              <NA> 33087               mediterranean restaurants fish  seafood restaurants 33088            plumbing services damage restoration  mold remediation phone 33085  17808512828 33086  14507781003 33087 302298072134 33088  14168005050"

edited Jun 22 '21 at 16:51

answered Jun 22 '21 at 16:44

Chris Ruehlemann

20,321
4
12
34

@ Chris Ruehlemann Thank you Chris! It ssems it is working on a single column, but when I print the dataframe on console, the entries are the same. I typed that: test_fb5$name <- gsub("<[^>]+>", "", test_fb5$name). When I print the test_fb5, the entries still show the unicodes. What I did wrong? – myhy Jun 22 '21 at 17:04
Print the test_fb5$name: [1] "md legals canada inc" [2] "les aspirateurs jpg inc" [3] "πάτραληςpatralis" [4] "wrench it up plumbing mechanical" When I print the test_fb5, the"name"entries looks unchanged: md legals canada inc 2 les aspirateurs jpg inc 3 ptapatralis 4 wrench it up plumbing mechanical Why is that? It seems that my changes are lost.... – myhy Jun 22 '21 at 17:42
Sorry, as long as you do not provide a reproducible data set you cannot expect anybody to be able to help you efficiently. So, make the dataframe, or a part of it, available in **reproducible** format. – Chris Ruehlemann Jun 22 '21 at 19:51

Unable to remove the foreign languages unicode codes

1 Answers1