Regex matching utf-8 pattern in R

Question

For the life of me, I can't get this to work:

I' trying to match a hex sequence like this (or any of those starting with a \x and ending in two numbers) "\xed\xa0\xbd\xed\xb8\x89" with the this regex "^\\s\\x[0-9]{2}$" but it won't work.

I'm thinking I have to start with a whitespace followed by \ and an x and then have the number range 0-9 repeated twice, or no?

Any help would be welcome!!

It is not quite clear what input you have and what your expected result should look like. Check [this demo](http://ideone.com/P6zvyR). Please update the question with details if it is not what you are looking for. — Wiktor Stribiżew, Apr 30 '17 at 16:57
Sorry, I just got started with regex and thought that'd be enough context... — Primesty, May 01 '17 at 15:37
So, what do you need to get in the end? Do the answers below answer your question or is my approach working for you? — Wiktor Stribiżew, May 01 '17 at 15:52
Unfortunately not. But this one worked: `gsub("[^A-Za-z0-9 ]", "", "I mean totally \xed\xa0\xbd\xed\xb8\x8a")` and it produces I mean totally. — Primesty, May 04 '17 at 00:20

score 1 · Accepted Answer · answered May 04 '17 at 06:23

You may remove any 1+ non-ASCII symbols with a [^ -~]+ regex:

> gsub("[^ -~]+", "", "I mean totally \xed\xa0\xbd\xed\xb8\x8a")
[1] "I mean totally "

See an online R demo.

The pattern means:

[^ - start of a negated character class
-~ - a range of chars in the ASCII table between a space (decimal code 32) and a tilde (decimal code 126)
] - end of the character class
+ - a quantifier, matching the subpattern to the left of it one or more times.

score 0 · Answer 2 · answered Apr 30 '17 at 17:03

0

I do not see any reason for the white space and what about the letters a-f? Also, why are you insisting that these should only occur at the beginning of the line? Try \\x[0-9a-f]{2} .

answered Apr 30 '17 at 17:03

G5W

36,531
10
47
80

score 0 · Answer 3 · answered Apr 30 '17 at 17:41

First, you may have some problems because when you assign a string like \x30 it is actually the hex representation. For example, for the representation of the zero ascii character:

> c = "\x30"
> c
[1] "0"

So it depends on how your string is represented and how it was assigned/read.

For the regex - here is something close to what you would need, demonstrated here with forward slash, not a backslash.

str_extract("/xed/xa0/xbd/xed/xb8/x89", "(\\/x[0-9a-f]{2})+")

[1] "/xed/xa0/xbd/xed/xb8/x89"

This is the regex from G5W above - but matches a sequence by surrounding in ()+

Thank you that helps a lot! – Primesty May 01 '17 at 15:37 — Primesty, May 01 '17 at 15:37

score 0 · Answer 4 · edited May 23 '17 at 12:18

0

Thanks to everyone who tried to help me!! This mostly seems to be an encoding problem...

This is what ultimately worked...

gsub("[^A-Za-z0-9 ]", "", "I mean totally \xed\xa0\xbd\xed\xb8\x8a")

which produced

"I mean totally " because it removes everything except letters and numbers...

I found it at how to replace single backslash in R and just had to add a whitespace so they were not deleted!

edited May 23 '17 at 12:18

Community

1
1

answered May 04 '17 at 00:23

Primesty

107
1
10

1

You just asked an unclear question, without stating if you posted a string literal or a literal string as an example. It is not a good idea to answer an unclear question yourself, because only you could do it. You may add this solution to the question itself, so that the question was clear. See my answer for an alternative solution. – Wiktor Stribiżew May 04 '17 at 06:25
Ah, I see! Thanks for the feedback! – Primesty May 04 '17 at 16:12

Regex matching utf-8 pattern in R

4 Answers4