1

I want to remove all superscript and subscript chars from the text.

Exp: '⁰'.

I found an example on stackoverflow, but it only considers superscript numbers and not characters or subscripts.

Anyone knows how to achieve this? A way would be to have all possible superscripts and subscripts and replace them one by one but that is a bit impractical.

Sebastian Lenartowicz
  • 4,695
  • 4
  • 28
  • 39
Herbi Shtini
  • 2,002
  • 29
  • 34
  • 1
    Are these chars all listed at http://www.fileformat.info/info/unicode/block/superscripts_and_subscripts/list.htm? Also, see http://unicode.org/charts/PDF/U2070.pdf – Wiktor Stribiżew Jul 18 '17 at 10:21
  • I guess there is no way but to make a long list of those chars and replace them one by one on a loop – Herbi Shtini Jul 18 '17 at 10:25
  • Why loop if you need to remove these chars? Can you please show an example string and expected output? BTW, does that list cover the chars you need? Try `.replace(/[\u2070\u2071\u2074-\u208E\u2090-\u209C]+/g, '')` – Wiktor Stribiżew Jul 18 '17 at 10:27

1 Answers1

2

Based on the subscript and superscript Unicode range reference and a manual search for "subscript" and "superscript" in a UniView tool, you may use

.replace(/[\u006E\u00B0\u00B2\u00B3\u00B9\u02AF\u0670\u0711\u2121\u213B\u2207\u29B5\uFC5B-\uFC5D\uFC63\uFC90\uFCD9\u2070\u2071\u2074-\u208E\u2090-\u209C\u0345\u0656\u17D2\u1D62-\u1D6A\u2A27\u2C7C]+/g, '')

See the regex demo.

The + quantifier (one or more consecutive occurrences) will make it easier for the regex engine to remove whole chunks of 1+ sub/superscript chars in one go.

Note that ᵀᴹ are modifier letters and are not formally superscript chars. If you want to include them, you need

var res = s.replace(/(?:\uD81A[\uDF40-\uDF43]|\uD81B[\uDF93-\uDF9F\uDFE0]|[\u006E\u00B0\u00B2\u00B3\u00B9\u02AF\u0670\u0711\u2121\u213B\u2207\u29B5\uFC5B-\uFC5D\uFC63\uFC90\uFCD9\u2070\u2071\u2074-\u208E\u2090-\u209C\u0345\u0656\u17D2\u1D62-\u1D6A\u2A27\u2C7C\u02B0-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0374\u037A\u0559\u0640\u06E5\u06E6\u07F4\u07F5\u07FA\u081A\u0824\u0828\u0971\u0E46\u0EC6\u10FC\u17D7\u1843\u1AA7\u1C78-\u1C7D\u1D2C-\u1D6A\u1D78\u1D9B-\u1DBF\u2071\u207F\u2090-\u209C\u2C7C\u2C7D\u2D6F\u2E2F\u3005\u3031-\u3035\u303B\u309D\u309E\u30FC-\u30FE\uA015\uA4F8-\uA4FD\uA60C\uA67F\uA69C\uA69D\uA717-\uA71F\uA770\uA788\uA7F8\uA7F9\uA9CF\uA9E6\uAA70\uAADD\uAAF3\uAAF4\uAB5C-\uAB5F\uFF70\uFF9E\uFF9F])+/g, '')

See this demo

To normalize subscript and superscript digits, it makes sense to use a dictionary and replace dynamically within an anonymous method passed as the replacement argument:

var super_sub_script_dict = {'\u2070': '0', '\u00B9': '1', '\u00B2': '2', '\u00B3': '3', '\u2074': '4', '\u2075': '5', '\u2076': '6', '\u2077': '7', '\u2078': '8', '\u2079': '9', '\u2080': '0', '\u2081': '1', '\u2082': '2', '\u2083': '3', '\u2084': '4', '\u2085': '5', '\u2086': '6', '\u2087': '7', '\u2088': '8', '\u2089': '9'};
var test_string = "Subscript: ₀₁₂₃₄₅₆₇₈₉ and superscript: ⁰¹²³⁴⁵⁶⁷⁸⁹";
var regex = new RegExp('[' + Object.keys(super_sub_script_dict).join("") + ']', 'g'); // => /[⁰¹²³⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉]/g
// Or
// var regex = /[\u00B9\u00B2\u00B3\u2070\u2074-\u2089]/g;
console.log(test_string.replace(regex, function(x) { 
    return super_sub_script_dict[x];
}))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This does not seem to work. I tried both "ᵀᴹ".replace(/[\u2070\u2071\u2074-\u208E\u2090-\u209C]+/gi, '') and "¹²³".replace(/[\u2070\u2071\u2074-\u208E\u2090-\u209C]+/g, '') and nothing was replaced. Btw thank you for your assistence – Herbi Shtini Jul 18 '17 at 12:09
  • 1
    Those `ᵀᴹ` are modifier letters. Do you also want to match modifier letters like that? – Wiktor Stribiżew Jul 18 '17 at 12:17
  • Everything in a superscript or subscript is causing an issue on my working app and I wanted to remove everything. Notice that even "¹²³" was not replaced from the regex you posted. I tested it on chrome console – Herbi Shtini Jul 18 '17 at 12:21
  • I updated the answer with 2 variations: one that includes chars whose Unicode names contain "subscript"/"superscript" words, and the second solution that also matches modifier letters in addition to the first pattern. – Wiktor Stribiżew Jul 18 '17 at 12:33
  • @WiktorStribiżew can we replace it with equivalent normal string? like for mg CaCO₃/L - mg CaCO3/L ... only for numbers will also work – MSD Nov 22 '19 at 07:00
  • 2
    @MSD I added a code snippet showing how to normalize the subscript and superscript digits. – Wiktor Stribiżew Nov 22 '19 at 10:00