Removing non-breaking space characters in R

Question

I have dataframe with several columns and 50K plus observations. Let's name it df1. One of the variables is PLATES (denoted here as "y"), which contains plate numbers of buses in a city. I want to match this data frame with another(df2) where I also have plates data. I want to keep matching records only. While looking at the data in df1, which comes from a CSV file, I realized that for y, several observations had symbols before the plate number that correspond to non-breaking space. How do I get rid of this so that it isn't an issue when I do the matching. Here's some code to help illustrate. Let's say you have 5 plate numbers:

y <- c(0740170, 0740111, 0740119, 0740115, 0740048)

But upon further inspection

view(y)

You see the following

<c2><a0>0740170
<c2><a0>0740111
<c2><a0>0740119
<c2><a0>0740115
<c2><a0>0740048

I tried this, from this post https://blog.tonytsai.name/blog/2017-12-04-detecting-non-breaking-space-in-r/, but didn't work

y <- gsub("\u00A0", " ", y, fixed = TRUE)

I would appreciate a lot your help on how to deal with this issue. Thanks!

What do you get in the console if you type `charToRaw(y[1])`? — Allan Cameron, Jun 09 '20 at 17:48
Thanks for replying Allan. I get to following message: "Error in charToRaw(y_2[1]) : argument must be a character vector of length 1" — Ricardo, Jun 09 '20 at 17:55

score 1 · Accepted Answer · answered Jun 09 '20 at 19:38

1

Not quite sure this will help as I can't test my answer (as I can't recreate your problem). But if non-breaking space characters are at the same time non-ASCII characters then, the solution would be this:

y <- gsub("[^ -~]+", "", y)

The pattern matches any non-ASCII characters and the replacement sets them to null. Hope this helps

answered Jun 09 '20 at 19:38

Chris Ruehlemann

20,321
4
12
34

It did! Thank you very much! – Ricardo Jun 10 '20 at 03:26

nstjhp · Answer 2 · 2023-07-04T10:52:03.897

1

EDIT 1 This works under R 4.0.3 and 4.1.2 on Windows, but no longer under 4.2.2 or 4.3.1.

The other answer matches any non-ASCII character but what if you need to keep non-ASCII characters e.g. letters with accents? In this situation I wanted to match specifically a non-breaking space of type <c2><a0> as in the question. What worked for me was matching \xa0

test # nbsp between type and II
# [1] "Diabète de type II"
tools::showNonASCII(test) 
# 1: Diab<c3><a8>te de type<c2><a0>II

# other answer
gsub("[^ -~]+", " ", test) # has missing è
# [1] "Diab te de type II"
tools::showNonASCII(gsub("[^ -~]+", " ", test))# no output as no non-ascii chars left

gsub("\xa0+", " ", test)
# [1] "Diabète de type II"
tools::showNonASCII(gsub("\xa0+", " ", test)) # the <c2><a0> nbsp is replaced
# 1: Diab<c3><a8>te de type II

Hat tip to http://www.pmean.com/posts/non-breaking-space/

EDIT 2 This example can be made to work on Windows and R 4.3.1 by also matching the <c2>

test = rawToChar(as.raw(c(0x44, 0x69, 0x61, 0x62, 0xc3, 0xa8, 0x74, 0x65, 0x20,  0x64, 0x65, 0x20, 0x74, 0x79, 0x70, 0x65, 0xc2, 0xa0, 0x49, 0x49)))
tools::showNonASCII(test)
# 1: Diab<c3><a8>te de type<c2><a0>II
tools::showNonASCII(gsub('\xc2\xa0+', '_', test))
# 1: Diab<c3><a8>te de type_II

edited Jul 04 '23 at 10:52

answered Dec 20 '22 at 13:43

nstjhp

528
6
12

Interesting I don't get the same result: z = rawToChar(as.raw(as.hexmode(c('48','69','c2','a0','48','69')))) tools::showNonASCII(z); gsub('\xa0+', '_', z) outputs HiHi "Hi Hi" – Bug Catcher Nakata Jun 29 '23 at 07:35
1

@BugCatcherNakata Yes I confirm what you get on R 4.2.2, worse on 4.3.1 I get an error! `Error in gsub("\xa0+", "_", z) : 'pattern' is invalid In addition: Warning message: In gsub("\xa0+", "_", z) : unable to translate '+' to a wide string` Unfortunately I don't remember which version of R I used when I made this post, 4.0 or 4.1 I guess. So presumably something has changed in R recently? – nstjhp Jun 29 '23 at 15:21
Damn. These non-breaking spaces have been causing me no end of trouble! – Bug Catcher Nakata Jul 02 '23 at 23:41
@BugCatcherNakata I have tested it on R4.0.3 and 4.1.2 on Windows. It works! `z = rawToChar(as.raw(as.hexmode(c('48','69','c2','a0','48','69')))); tools::showNonASCII(z); gsub('\xa0+', '_', z) ; 1: HiHi [1] "HiÂ_Hi" ` Not sure why gives a "Â" instead of being part of the NBSP, but maybe this can help locate the change. The only thing I see in [recent R news](https://cran.r-project.org/doc/manuals/r-release/NEWS.html) that could be relevant for 4.2.0 is "R uses UTF-8 as the native encoding on recent Windows systems" – nstjhp Jul 04 '23 at 09:37
@BugCatcherNakata New edit will hopefully solve the nightmare! – nstjhp Jul 04 '23 at 10:52
Thank you!!!! If you post that as an answer to my question here https://stackoverflow.com/questions/76601289/replacing-characters-in-r-string-based-on-raw-hex-values I will accept it – Bug Catcher Nakata Jul 05 '23 at 00:08

Removing non-breaking space characters in R

2 Answers2

Linked

Related