1

Im currently writing a program in python where I have to figure out smileys like these :), :(, :-), :-( should be replace if it is followed by special characters and punctuation should be replaced in this pattern : ex : Hi, this is good :)# should be replaced to Hi, this is good :).

I have created regex pattern for sub it but couldn't enclose this smiley :-) in my re.compile.It is considering that as a range.

re.sub(r"[^a-zA-Z0-9:):D)]+", " " , words) this is working fine I need to add :-) smiley to the regex.

vahdet
  • 6,357
  • 9
  • 51
  • 106
noobster
  • 872
  • 1
  • 6
  • 17
  • Please [check my answer](https://stackoverflow.com/a/54997594/3832970), with demo and explanations. Note that the main problem with your pattern is that it contains a single character class where you added *a sequence of patterns* to be matched, but it does not work like that. You need a grouping here. – Wiktor Stribiżew Mar 05 '19 at 13:14
  • This is not a shameless promotion of my answer, but you could also check my answer to see if it worked for you. – Tim Biegeleisen Mar 06 '19 at 14:32
  • @WiktorStribiżew it works perfectly ! but the same regex pattern in python 2 is throwing an error. – noobster Mar 06 '19 at 14:35

3 Answers3

1

One approach is to use the following pattern:

(:\)|:\(|:-\)|:-\()[^A-Za-z0-9]+

This matches and captures a smiley face, then matches any number of non alphanumeric characters immediately afterwards. The replacement is just the captured smiley face, thereby removing the non alpha characters.

input = "Hi, this is good :)#"
output = re.sub(r"(:\)|:\(|:-\)|:-\()[^A-Za-z0-9]+", "\1" , input)
print(output)

Hi, this is good :)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

The [^a-zA-Z0-9:):D)] pattern is erronrous since it is a character class meant to match sequences of chars. You need to add an alternative to this regex that will match char sequences.

To remove any punctuation other than a certain list of smileys you may use

re.sub(r"(:-?[()D])|[^A-Za-z0-9\s]", r"\1" , s)

Or, in Python 3.4 and older, due to the re.sub bug:

re.sub(r"(:-?[()D])|[^A-Za-z0-9,\s]", lambda x: x.group(1) if x.group(1) else "", s)

If you really need to avoid removing commas, add , into the negated character class:

re.sub(r"(:-?[()D])|[^A-Za-z0-9,\s]", r"\1" , s)
                               ^

See the regex demo.

Details

  • (:-?[()D]) - matches and captures into Group 1 a :, then an optional -, and then a single char from the character class: (, ) or D (this captures the smileys like :-), :-(, :), :(, :-D, :D)
  • [^A-Za-z0-9,\s] - matches any char but an ASCII letter, digit, comma and whitespace. To make it fully Unicode aware, replace with (?:[^\w\s,]|_).

See the Python 3.5+ demo:

import re
s = "Hi, this is good :)#"
print( re.sub(r"(:-?[()D])|[^A-Za-z0-9,\s]", r"\1" , s) )
# => Hi, this is good :)

See this Python 3.4- demo:

import re
s = "Hi, this is good :)#"
print( re.sub(r"(:-?[()D])|[^A-Za-z0-9,\s]", lambda x: x.group(1) if x.group(1) else "", s) )
# => Hi, this is good :)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This is the perfect regex pattern it should recognize only the smileys rather than ")" and ":".Python2 throws an error while having this regex pattern.it will work only on python 3 ? – noobster Mar 06 '19 at 14:32
  • @noobster Yes, it won't work with any Python before Python 3.5 where the issue was fixed. Use https://rextester.com/VKR32235 with earlier versions. – Wiktor Stribiżew Mar 06 '19 at 14:42
  • @noobster Glad it works, please consider accepting the answer then. – Wiktor Stribiżew Mar 07 '19 at 00:43
0

you can escape special characters with \ try:

re.sub("[^a-zA-Z0-9:):D:\-))]+", " " , words)
vencaslac
  • 2,727
  • 1
  • 18
  • 29