1

For example:

import re

s1 = 'LOGO 设计'
## s2 = '设计 LOGO'

s = re.sub('[a-zA-Z0-9]{3,}(\s)[^a-zA-Z0-9]', '', s1)

print(s)

I want to find at least 3 ascii chars, followed by a space, then followed by a nonascii char, and replace the white space with empty string. My code has two issues:

  1. How to write the replacement string for (\s)?

  2. How to make it also work for the reverse order of s2?:

    [^a-zA-Z0-9]

marlon
  • 6,029
  • 8
  • 42
  • 76
  • `[^a-zA-Z0-9]` doesn't mean non-ASCII. Punctuation characters like `!&^` are ASCII, but they'll be matched by that. – Barmar Sep 16 '22 at 22:31
  • How to represent cjk characters in the 2nd part? – marlon Sep 16 '22 at 23:00
  • See https://stackoverflow.com/questions/150033/regular-expression-to-match-non-ascii-characters – Barmar Sep 16 '22 at 23:02
  • I replace it with [^\x00-\x7F]+ and it seems working. If I just want to check one char, so would [^\x00-\x7F] be more efficient? Overall, are these kinds of regex operations efficient enough? I am working on social media texts and although most of the text are not long, but some may be long as a typical news article. – marlon Sep 16 '22 at 23:09
  • As long as you don't have quantifiers that allow for long matches and may require backtracking, you should be fine. – Barmar Sep 16 '22 at 23:10
  • {3} wont' be a quantifier? – marlon Sep 16 '22 at 23:11
  • Yes, but it doesn't allow for long matches like `*` and `+` do. – Barmar Sep 16 '22 at 23:12
  • so it's better to use [^\x00-\x7F] instead of [^\x00-\x7F]+? The latter has a '+'. – marlon Sep 16 '22 at 23:13
  • Yes. You don't need `+` since matching 1 character after the space is enough. – Barmar Sep 16 '22 at 23:14
  • [`s=re.sub(r'(?iu)([a-z\d]{3}(?=\s[^\W\da-z])|[^\W\da-z](?=\s[a-z\d]{3}))\s',r'\1',s1)`](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEolYur2FDBVkHdx9/dX@HFun0v1i1U51KGiEG4CiApdaA6oFBRql5xaZJGkbqGfWappkZ0om5VTEpstXGthr1tTHF0XEx4TApQLFazBsGGSMFVamrGFKvrKBSpxxgCqWJDTS6ugqLMvBKNYs3//wE) (leaving my idea as a comment, guess it did not work for you) – bobble bubble Sep 17 '22 at 00:55

2 Answers2

1

Put the strings that you want to keep in the result in capture groups, then reference them in the replacement.

s = re.sub(r'([a-zA-Z0-9]{3})\s([^a-zA-Z0-9])', r'\1\2', s1)

You don't need to use {3,}, you can just use {3}. This will copy the last 3 characters to the result. All the preceding characters will be copied by default because they're not being replaced.

You can also do it with lookarounds, by matching a space that's preceded by 3 ASCII characters and followed by a non-ASCII. Then you replace the space with an empty string.

s = re.sub(r'(?<=[a-zA-Z0-9]{3})\s(?=[^a-zA-Z0-9])', '', s1)

You can use alternative in this method to match both orders

s = re.sub(r'(?<=[a-zA-Z0-9]{3})\s(?=[^a-zA-Z0-9])|(?<=[^a-zA-Z0-9])\s(?=[a-zA-Z0-9]{3})', '', s1)
Barmar
  • 741,623
  • 53
  • 500
  • 612
0

With lookahead and lookbehind

s1 = 'LOGO 设计 SKY  आकाश'

st = re.split(r'(?<=[^a-zA-Z])(?=[a-zA-Z])',s1)

[re.sub(r'\s+','',e) for e in st]

['LOGO设计', 'SKYआकाश']
LetzerWille
  • 5,355
  • 4
  • 23
  • 26