0

HTML's &emdash; in UTF is — and hex is seen to be 1EFBBBFE280941. How can it be encoded into a regular expression /[ ]/ compare? (An "if this, then that") type of comparison? The above would just be several ASCII characters, but how to encode it in a regular expression to be just one character?

Is the concept even possible?

EDIT: Still confused, but I realize now that some of my confusion was that the EF BB BF was some sort of artifact added in my hex viewer. It is at the TOP of the hex list once, but it's not in any of the — data bytes being examined.

My &emdash; was confusing its name Em Dash, but the code is — and its data is E2 80 94. I actually use it as — in my own code, thinking it saves one more lookup.

As to providing code, I guess that would be this:

HTML:  
<div id="a12"> —    —   —   —   —   -40.83  337.01  147.96  -31.27  -82.16  47.42 -ABC- 1 &mdash; 2 &#x2014; 3 &#8212; 4</div>

The first part is Copy/Paste of the 11 years of Total Return entries
 at  https:/.www.morningstar.com/stocks/xnas/roku/price-fair-value
The 2nd part is the 3 HTML equivalents of the &mdash; character.
The goal is to remove all &mdash;
 (to be replaced by "" ... there are other blank and tab there with it)


My Javascript testing:
var x = document.getElementById("a12");
var m = x.textContent;
var rgex =  /u\2014/g;
//var   rgex = /[u\2014]/g;
//var   rgex = /\u{2014}\u/g ;

var n = rgex.test(m);       //but test is false
m = m.replace(rgex, "");        
alert( m + "      " + n);   //and string is unchanged

The /u\2014/g is my try from reading, but it is not working (nor other tries too).

So my question is, what should it be to detect &mdash; ?

EDIT: FWIW, I THINK I GOT IT!

Looks like var rgex = /\u2014/g; does it well. I had tried that before, but must have had a syntax issue.

and var rgex = /[\t \u2014]/g; also removes the tabs and space (all replaced with a space for separator).

https://www.http://unicode.org/reports/tr18/ was the help for me, section 1.1 makes it pretty clear.

WayneF
  • 235
  • 2
  • 10
  • 1
    Does this answer your question? [How can I use Unicode-aware regular expressions in JavaScript?](https://stackoverflow.com/questions/280712/how-can-i-use-unicode-aware-regular-expressions-in-javascript) – Brian61354270 Mar 27 '23 at 00:59
  • 1
    Obligatory background reading: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/). Do note that UTF-8 is _an_ encoding for Unicode codepoints. U+2014 is the _Unicode codepoint_ for `—`. In UTF-8, it's encoded by the three-byte sequence `E2 80 94` – Brian61354270 Mar 27 '23 at 01:01
  • 2
    The codepoint for `—` in hex is `2014` - where does `1EFBBBFE280941` come from? – traktor Mar 27 '23 at 03:01
  • #Brian61354270 Thanks, but nothing is clear yet. I thought one page said /\u2014/g but that doesn't work. It's going to take a lot of reading to sort that out. I just want to change emdash to a blank 20. This is Morningstar Total Return % stats, and early years before the public stock existed was blank with space tab only, 20 and 09. But this weekend it changed, new and perhaps in work, and now I see space and tab with E2 80 94 between them, which displays as emdash. But no actual emdash as I know it. A sample to see is it https://www.morningstar.com/stocks/xnas/roku/price-fair-value – WayneF Mar 27 '23 at 04:31
  • #traktor I am quite confused by it all., a new subject for me. I did a copy/paste on some emdash online (web page) and converted to hex. I can't find that now, but at that time I saw it as 1EFBBBFE280941. I figured out it must be 1 EF BB BF E2 80 94 1. Don't know what the 1's are, terminators maybe. But there was no x2014 type character. There were five emdash visible, and four of these 1EFBBBFE280941 with space and tab (20 and 09) characters between them. But now it is space and tab as before, but with E2 80 94 between them now. – WayneF Mar 27 '23 at 04:32
  • 1
    `EFBBBF` and `E28094` are UTF-8 byte sequences for `` (U+FEFF, *Zero Width No-Break Space/BYTE ORDER MARK*) and `—` (U+2014, *Em Dash*). Unclear what the 1's are; please [edit] your question to provide a [mcve]. BTW, `&emdash;` is nothing - maybe you mean `—`? – JosefZ Mar 27 '23 at 09:53
  • Not sure how you got to `/\u{2014}\u/g` and `/[/u\2014]/g`. It's `/\u{2014}/ug` or `/\u2014/g`. Possibly `/[\u{2014}]/ug` or `/[\u2014]/g`, but if you want to match only a single character, there's no point in using a character class. – Bergi Mar 27 '23 at 20:01
  • Your edit still uses the wrong regular expressions. Please look exactly where the backslashes and slashes go. – Bergi Mar 27 '23 at 20:23

0 Answers0