1

I have some strings have uses subscript and superscript.

Is there anyway i can remove them while keeping my string?

Here is an example, ¹ºUnless otherwise indicated. How can i remove the superscript of ¹º?

Thanks in advance!

  • 1
    When you say "remove the superscript", do you mean removing those characters from the string entirely, or are you hoping to un-superscript the superscript characters somehow? – user2357112 May 15 '20 at 04:58
  • i meant removing those characters from the string entirely. having the final output as 'Unless otherwise indicated'. – libertefudgy May 15 '20 at 07:58

2 Answers2

2

The ordinal values of ASCII characters (subscript/superscript characters are not in the ASCII table) are in the range(128). Note that range(128) excludes the upper bound (and when a lower bound is not provided, 0 is assumed to be the lower bound) of the range, so this maps to all of the numbers from 0-127. So, you can strip out any characters which are not in this range:

>>> x = '¹ºUnless otherwise indicated'
>>> y = ''.join([i for i in x if ord(i) < 128])
>>> y
'Unless otherwise indicated'

This iterates over all of the characters of x, excludes any which are not in the ASCII range, and then joins the resulting list of characters back into a str

awarrier99
  • 3,628
  • 1
  • 12
  • 19
  • 1
    `str.join` actually builds a list out of the input anyway (to figure out the highest code point in the input and preallocate the result), so using a generator doesn't save time or memory. – user2357112 May 15 '20 at 05:00
  • @user2357112supportsMonica Never knew it. That's the reason I suggested the change. Surely will look into it. Thanks for pointing it out. – Ch3steR May 15 '20 at 05:02
  • 1
    In fact using the generator is [slower](https://stackoverflow.com/a/9061024/3620003) with `str.join`. – timgeb May 15 '20 at 05:02
  • Interesting, I guess it's better to recommend the other version then. Thanks for the info – awarrier99 May 15 '20 at 05:03
2

The only sure way you can do is to enumerate all superscript and subscript symbols that might occur and remove the characters that match this set.

If your string is not so weird, you may try to identify for "letter other" and "number other" categories, which would cover other characters in addition to super- and subscripts. Such as this:

import unicodedata
s = "¹ºUnless otherwise indicated"
cleaned = "".join(c for c in s if unicodedata.category(c) not in ["No", "Lo"])
adrtam
  • 6,991
  • 2
  • 12
  • 27