4

I'm struggling to figure out a reasonable solution to this. I need to replace the following characters: ⁰¹²³⁴⁵⁶⁷⁸⁹ using a regex replace. I would think that you would just do this:

item = item.replace(/[⁰¹²³⁴⁵⁶⁷⁸⁹]/g, '');

However, when I try to do that, notepad++ converts symbols 5-9 into regular script numbers. I realize this probably relates to the encoding format I am using, which I see is set to ANSI.

I've never really understood the difference between the various encoding formats. But I'm wondering if there is any easy fix for this issue?

Richard Hamilton
  • 25,478
  • 10
  • 60
  • 87
COMisHARD
  • 867
  • 3
  • 13
  • 36
  • Have you tried setting notepad++ encoding to utf8? – Andy Ray Mar 13 '16 at 22:58
  • 3
    ^ which you should **always** be using, for everything – adeneo Mar 13 '16 at 22:58
  • Also, you have to wrap that up `/[⁰¹²³⁴⁵⁶⁷⁸⁹]/g` properly, you're missing the starting bracket – adeneo Mar 13 '16 at 23:01
  • 2
    You really have to know the difference between the various character encodings. It is *essential.* This should help start your journey. http://kunststube.net/encoding/ – Jeremy J Starcher Mar 13 '16 at 23:01
  • 1
    Works just fine if you correct the regex *(and jsFiddle is using UTF8)* -> **https://jsfiddle.net/x010mpdp/** – adeneo Mar 13 '16 at 23:15
  • You could try ECMAScript 2015 [*unicode escape sequences*](https://mathiasbynens.be/notes/es6-unicode-regex), but support might be lacking… – RobG Mar 13 '16 at 23:22

3 Answers3

6

Here is the simple regex for finding all superscript numbers

/\p{No}/gu/

Breakdown:

  • \p{No} matches a superscript or subscript digit, or a number that is not a digit [0-9]
  • u modifier: unicode: Pattern strings are treated as UTF-16. Also causes escape sequences to match unicode characters
  • g modifier: global. All matches (don't return on first match)

https://regex101.com/r/zA8sJ4/1

Now, most modern browsers still have no built in support for unicode numbers in regex. I would recommend using the xregexp library

XRegExp provides augmented (and extensible) JavaScript regular expressions. You get new modern syntax and flags beyond what browsers support natively. XRegExp is also a regex utility belt with tools to make your client-side grepping and parsing easier, while freeing you from worrying about pesky aspects of JavaScript regexes like cross-browser inconsistencies or manually manipulating lastIndex.

http://xregexp.com/

HTML Solution

HTML has a <sup> tag for representing superscript text.

The tag defines superscript text. Superscript text appears half a character above the normal line, and is sometimes rendered in a smaller font. Superscript text can be used for footnotes, like WWW[1].

If there are superscript numbers, the html markup almost surely has the sup tag.

var math = document.getElementById("math");

math.innerHTML = math.innerHTML.replace(/<sup>[\d]?<\/sup>/g, "");
<p id="math">4<sup>2</sup>+ 3<sup>2</sup></p>
Richard Hamilton
  • 25,478
  • 10
  • 60
  • 87
  • I don't think that's a valid regex in javascript, the unicode flag is not supported – adeneo Mar 13 '16 at 23:19
  • @adeneo—Unicode escape sequences (and the u flag) are supported in ECMAScript 2015, however not many browsers seem to have implemented them yet. – RobG Mar 13 '16 at 23:25
  • `\p{No}` also matches around [600 other characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Other_Number:]) that aren't superscript numbers. – 一二三 Mar 13 '16 at 23:28
  • @RobG - indeed, didn't know that. I can find it in the spec, but not much about browser support, seems it's not really supported anywhere yet. However, the OP's regex works just fine. – adeneo Mar 13 '16 at 23:29
  • 1
    [You can't parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/1529630) – Oriol Mar 13 '16 at 23:37
  • your regex is not supported by javascript. – Saleem Mar 13 '16 at 23:48
  • Even using `\p{No}` the other number property, in Unicode 11 that matches 807 code points of which the 10 superscript code points are a subset. So, you wouldn't use this to find superscript, it matches too much. –  Dec 07 '18 at 18:53
3

Use UTF-8. If for some reason you can't, a workaround is escaping

var rg = new RegExp(
  "[\u2070\u00b9\u00b2\u00b3\u2074\u2075\u2076\u2077\u2078\u2079]",
  "g"
);
Oriol
  • 274,082
  • 63
  • 437
  • 513
2

I'd suggest trying following regex:

/[\u2070-\u209f\u00b0-\u00be]+/g

Code will look like

var re = /[\u2070-\u209f\u00b0-\u00be]+/g; 
var str = '⁰¹²³⁴⁵⁶⁷⁸⁹';
var subst = ''; 

var result = str.replace(re, subs);

result will contain after successful run:

2sometext

See demo here

Saleem
  • 8,728
  • 2
  • 20
  • 34