how to find private char utf8 in a text?

Question

In the UTF-8 encoding table and Unicode characters, i use the Supplementary private use area because there are single char that i m sure they won't be used in any text. The fact is now i need to find them in a text. Here's a basic example :

\u{f0001} hahrehr \u{f0002} eryteryte \u{f0003}\n yfukguk\u{f0004}\nggikggk

You can see that \u{f...} are my special chars. if we console.log this text :

console.log("\u{f0001} hahrehr \u{f0002} eryteryte \u{f0003}\n yfukguk\u{f0004}\nggikggk</");

Now i need something to find all of those special char. I thought of a regexp but I don’t know how handled the fact that \u{f...} is interpreted differently.

I know that my probleme is not very clear but i take any idea which can help me.

So you need to capture all the characters "\u{..}" with this format ? — namar sood, Jun 30 '20 at 16:01
`/\udb80\udc01/` will find it, because it's actually two ucs16 characters. See https://mathiasbynens.be/notes/javascript-unicode and https://mathiasbynens.be/notes/es6-unicode-regex — Bergi, Jun 30 '20 at 16:34

Wiktor Stribiżew · Accepted Answer · 2020-06-30T17:42:38.137

1

There are three private use areas:

One in the Basic Multilingual Plane, \uE000-\uF8FF,
Plane 15, \u{F0000}-\u{FFFFD}, and
Plane 16, \u{100000}-\u{10FFFD}.

You may use

/[\uE000-\uF8FF\u{F0000}-\u{FFFFD}\u{100000}-\u{10FFFD}]/gu

to match all the occurrences of these characters with the ES6 compliant regex.

See Regex modifier /u in JavaScript? to learn more about u modifier. Here, it is necessary to support \u{XXXXX} notation.

The ES5 compliant pattern is

/(?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])/g

To get the array of hex code for the code points matched use some additional JavaScript code:

const str = "\u{f0001} hahrehr \u{f0002} eryteryte \u{f0003}\n yfukguk\u{f0004}\nggikggk</";
const regex = /[\uE000-\uF8FF\u{F0000}-\u{FFFFD}\u{100000}-\u{10FFFD}]/gu;
console.log(
  str.match(regex).map(x => Array.from(x)
    .map((v) => v.codePointAt(0).toString(16))
    .map((hex) => "0000".substring(0, 4 - hex.length) + hex))
);

edited Jun 30 '20 at 17:42

answered Jun 30 '20 at 16:29

Wiktor Stribiżew

607,720
39
448
563

You might want to point out how the `u` flag is essential. – Bergi Jun 30 '20 at 16:36
should .match(/[{0001}-{ffff}]/ug); works to get only number of each \u{f...} char ? – Esposito Jun 30 '20 at 17:21
@Esposito No, you do not do that this way. If you need to get the hex codes, you will need to process matches with some additional logic. – Wiktor Stribiżew Jun 30 '20 at 17:23
yes i know, something like that i tried : ```var match = text.match(/[\u{10000}-\u{fffff}]/ug);``` ```var match2 = match[0].match(/[{10000}-{fffff}]/ug);``` but once again it's not working cause he try to match with the private char and not the STRING – Esposito Jun 30 '20 at 17:30
@Esposito You're looking for `match.map(m => m.codePointAt(0))` – Bergi Jun 30 '20 at 17:35
I thought I could adapt your code to my use but apparently I’m not good enough at it. I need to replace each match with its value (which is stored in a map) but I have tested several things, including this one (next comm) but it does not work. I probably don’t have the hindsight to find the solution I’ve been on for an hour. – Esposito Jul 01 '20 at 15:26
`var map = new Map(); map.set("󰀁", 'something1'); map.set("󰀂", 'something2'); const str = "\u{f0001} lorem ipsum \u{f0002} dolor sit amet \u{f0003}\n consectetur adipiscing elit\u{f0004}\sed do eiusmod tempor"; var tab = str.match(regex).map(x => Array.from(x) .map((v) => v.codePointAt(0).toString(10)) .map((hex) => "0000".substring(0, 4 - hex.length) + hex)) for(var i = 0 ; i < tab.length ; i++){ str.replace(/[\uE000-\uF8FF\u{F0000}-\u{FFFFD}\u{100000}-\u{10FFFD}]/u, map.get(""+tab[i][0]+";")); }` – Esposito Jul 01 '20 at 15:26
hard to understand that code, may i do something to fix indentation? – Esposito Jul 01 '20 at 15:27
Update the question. – Wiktor Stribiżew Jul 01 '20 at 15:30
I just asked a new question "replace char with regexp and map - js" – Esposito Jul 01 '20 at 15:44

how to find private char utf8 in a text?

1 Answers1

Linked