-2

Modifier letter are like these; I am curious what is the most efficient way to remove them from a list of strings.

I know I can make a list, containing all these unicodes and run a for loop that goes through all of them against the string. I wonder how I can remove them using "re" package and perhaps specifying their range.

my string looks like

mystr = 'سلام خوبی dsdsd ᴶᴼᴵᴺ'

this is the unicode for 'ᴶ'

https://www.compart.com/en/unicode/U+1D36

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Areza
  • 5,623
  • 7
  • 48
  • 79
  • 1
    see https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties. Basically, you need one of the solutions there to eliminate `\p{Lm}`. – gog Jun 23 '22 at 08:47

2 Answers2

0

You can find unicode categories here:

https://unicodebook.readthedocs.io/unicode.html

You can try this code (python3):

import unicodedata

inputData = u"سلام خوبی dsdsd ᴶᴼᴵᴺ"
print(u"".join( x for x in inputData if not unicodedata.category(x)=='Sk'))
Saxon
  • 739
  • 3
  • 6
  • turned out the regex solution : d = re.sub("\p{LM}", "", text) - is faster if the text is a long sentence - but for shorter text your method is superior - since this is the only available answer for now I accept yours. However, I will go ahead with regex – Areza Jun 23 '22 at 09:34
0

Turned out the regex is faster for longer sentence

import unicodedata

inputData = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"

a = time.time()
for i in range(1_000_000):
    d = u"".join( x for x in inputData if not unicodedata.category(x)=='Sk')

print(time.time() - a)

which took on my 2,4 GHz 8-Core Intel Core i9 - 17.69 second

import time
import regex as re

text = u"سلام خوبی dxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsddxxxxxxxxxxsdsd ᴶᴼᴵᴺ"

a = time.time()
for i in range(1_000_000):
    d = re.sub("\p{LM}", "", text)

print(time.time() - a)

took 6.1 second

if you use

u"سلام خوبی dxxxxxxxxxxsdᴶᴼᴵᴺ"

the regex approach is 6.08 second while the character level look is 5.08 second.

Areza
  • 5,623
  • 7
  • 48
  • 79