1

I am trying to replace all special characters using Regex and comparing between JavaScript (node.js v10.16.3) and Python (3.7.x)

\t kickref, first really multi-level referral program on the сrypto market, has reached over 20 000 users just in 2 days after its start on september 28.

Splitting the sentence into characters just to see the ASCII codes gives me this character array

'["\\t"," ","k","i","c","k","r","e","f",","," ","f","i","r","s","t"," ","r","e","a","l","l","y"," ","m","u","l","t","i","-","l","e","v","e","l"," ","r","e","f","e","r","r","a","l"," ","p","r","o","g","r","a","m"," ","o","n"," ","t","h","e"," ","с","r","y","p","t","o"," ","m","a","r","k","e","t",","," ","h","a","s"," ","r","e","a","c","h","e","d"," ","o","v","e","r"," ","2","0"," ","0","0","0"," ","u","s","e","r","s"," ","j","u","s","t"," ","i","n"," ","2"," ","d","a","y","s"," ","a","f","t","e","r"," ","i","t","s"," ","s","t","a","r","t"," ","o","n"," ","s","e","p","t","e","m","b","e","r"," ","2","8","."]'

This would be the ASCII codes for each letter

'[9,32,107,105,99,107,114,101,102,44,32,102,105,114,115,116,32,114,101,97,108,108,121,32,109,117,108,116,105,45,108,101,118,101,108,32,114,101,102,101,114,114,97,108,32,112,114,111,103,114,97,109,32,111,110,32,116,104,101,32,1089,114,121,112,116,111,32,109,97,114,107,101,116,44,32,104,97,115,32,114,101,97,99,104,101,100,32,111,118,101,114,32,50,48,32,48,48,48,32,117,115,101,114,115,32,106,117,115,116,32,105,110,32,50,32,100,97,121,115,32,97,102,116,101,114,32,105,116,115,32,115,116,97,114,116,32,111,110,32,115,101,112,116,101,109,98,101,114,32,50,56,46]'

The particularly important problem is due to the letter 'c' in the word crypto. Notice its ASCII code is 1089 in the Array

In JS my code to replace the regex looks as follows

const regexSpecialCharacters = new RegExp(/\W/, 'g');
text.replace(regexSpecialCharacters, ' ');

This yields the following sentence

kickref  first really multi level referral program on the  rypto market  has reached over 20 000 users just in 2 days after its start on september 28 

The letter c got removed In Python, my regex to do the exact same thing looks like this

import re
regex_special_characters = re.compile(r'\W')
regex_special_characters.sub(' ', text)

This gives me the following output

kickref  first really multi level referral program on the сrypto market  has reached over 20 000 users just in 2 days after its start on september 28 

The letter c here has NOT been removed in Python Can anyone kindly tell me why, I dont want JS removing the letter c either, what do I do?

PirateApp
  • 5,433
  • 4
  • 57
  • 90

1 Answers1

3

с is a Cyrillic letter. \W in Python 3 is Unicode aware by default, but it is not in JavaScript.

To also remove it in Python, pass re.ASCII as the flag:

import re
regex_special_characters = re.compile(r'\W', re.ASCII)
regex_special_characters.sub(' ', text)

More details from the re.ASCII documentation:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thanks! how do you make the character not get dropped in JS if you have any idea, if you use the flag u in JS it still drops the letter c – PirateApp Oct 11 '19 at 10:02
  • if you take this string s = 'hernández' and run the same regex in js and python to replace \W the python version due to unicode support doesnt drop a but the JS version does – PirateApp Oct 11 '19 at 10:04
  • 1
    @PirateApp You mean to make `\W` Unicode aware in JS RegExp? In Chrome, `replace(/[^\d\p{L}]/gu, ' ')` will work, but it will not work in FF and IE. For older browsers/JS environments, use approaches listed in [Javascript + Unicode regexes](https://stackoverflow.com/questions/280712/javascript-unicode-regexes) – Wiktor Stribiżew Oct 11 '19 at 10:06