HTML's &emdash;
in UTF is —
and hex is seen to be 1EFBBBFE280941
. How can it be encoded into a regular expression /[ ]/
compare? (An "if this, then that") type of comparison? The above would just be several ASCII characters, but how to encode it in a regular expression to be just one character?
Is the concept even possible?
EDIT: Still confused, but I realize now that some of my confusion was that the EF BB BF was some sort of artifact added in my hex viewer. It is at the TOP of the hex list once, but it's not in any of the —
data bytes being examined.
My &emdash;
was confusing its name Em Dash, but the code is —
and its data is E2 80 94. I actually use it as —
in my own code, thinking it saves one more lookup.
As to providing code, I guess that would be this:
HTML:
<div id="a12"> — — — — — -40.83 337.01 147.96 -31.27 -82.16 47.42 -ABC- 1 — 2 — 3 — 4</div>
The first part is Copy/Paste of the 11 years of Total Return entries
at https:/.www.morningstar.com/stocks/xnas/roku/price-fair-value
The 2nd part is the 3 HTML equivalents of the — character.
The goal is to remove all —
(to be replaced by "" ... there are other blank and tab there with it)
My Javascript testing:
var x = document.getElementById("a12");
var m = x.textContent;
var rgex = /u\2014/g;
//var rgex = /[u\2014]/g;
//var rgex = /\u{2014}\u/g ;
var n = rgex.test(m); //but test is false
m = m.replace(rgex, "");
alert( m + " " + n); //and string is unchanged
The /u\2014/g is my try from reading, but it is not working (nor other tries too).
So my question is, what should it be to detect —
?
EDIT: FWIW, I THINK I GOT IT!
Looks like var rgex = /\u2014/g;
does it well. I had tried that before, but must have had a syntax issue.
and var rgex = /[\t \u2014]/g;
also removes the tabs and space (all replaced with a space for separator).
https://www.http://unicode.org/reports/tr18/ was the help for me, section 1.1 makes it pretty clear.