How to remove white space in between ascii and nonascii chars?

Question

For example:

import re

s1 = 'LOGO 设计'
## s2 = '设计 LOGO'

s = re.sub('[a-zA-Z0-9]{3,}(\s)[^a-zA-Z0-9]', '', s1)

print(s)

I want to find at least 3 ascii chars, followed by a space, then followed by a nonascii char, and replace the white space with empty string. My code has two issues:

How to write the replacement string for (\s)?
How to make it also work for the reverse order of s2?:

[^a-zA-Z0-9]

`[^a-zA-Z0-9]` doesn't mean non-ASCII. Punctuation characters like `!&^` are ASCII, but they'll be matched by that. — Barmar, Sep 16 '22 at 22:31
See https://stackoverflow.com/questions/150033/regular-expression-to-match-non-ascii-characters — Barmar, Sep 16 '22 at 23:02
I replace it with [^\x00-\x7F]+ and it seems working. If I just want to check one char, so would [^\x00-\x7F] be more efficient? Overall, are these kinds of regex operations efficient enough? I am working on social media texts and although most of the text are not long, but some may be long as a typical news article. — marlon, Sep 16 '22 at 23:09
As long as you don't have quantifiers that allow for long matches and may require backtracking, you should be fine. — Barmar, Sep 16 '22 at 23:10
Yes, but it doesn't allow for long matches like `*` and `+` do. — Barmar, Sep 16 '22 at 23:12
so it's better to use [^\x00-\x7F] instead of [^\x00-\x7F]+? The latter has a '+'. — marlon, Sep 16 '22 at 23:13
Yes. You don't need `+` since matching 1 character after the space is enough. — Barmar, Sep 16 '22 at 23:14
[`s=re.sub(r'(?iu)([a-z\d]{3}(?=\s[^\W\da-z])|[^\W\da-z](?=\s[a-z\d]{3}))\s',r'\1',s1)`](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEolYur2FDBVkHdx9/dX@HFun0v1i1U51KGiEG4CiApdaA6oFBRql5xaZJGkbqGfWappkZ0om5VTEpstXGthr1tTHF0XEx4TApQLFazBsGGSMFVamrGFKvrKBSpxxgCqWJDTS6ugqLMvBKNYs3//wE) (leaving my idea as a comment, guess it did not work for you) — bobble bubble, Sep 17 '22 at 00:55

Barmar · Accepted Answer · 2022-09-16T23:20:44.883

1

Put the strings that you want to keep in the result in capture groups, then reference them in the replacement.

s = re.sub(r'([a-zA-Z0-9]{3})\s([^a-zA-Z0-9])', r'\1\2', s1)

You don't need to use {3,}, you can just use {3}. This will copy the last 3 characters to the result. All the preceding characters will be copied by default because they're not being replaced.

You can also do it with lookarounds, by matching a space that's preceded by 3 ASCII characters and followed by a non-ASCII. Then you replace the space with an empty string.

s = re.sub(r'(?<=[a-zA-Z0-9]{3})\s(?=[^a-zA-Z0-9])', '', s1)

You can use alternative in this method to match both orders

s = re.sub(r'(?<=[a-zA-Z0-9]{3})\s(?=[^a-zA-Z0-9])|(?<=[^a-zA-Z0-9])\s(?=[a-zA-Z0-9]{3})', '', s1)

edited Sep 16 '22 at 23:20

answered Sep 16 '22 at 22:29

Barmar

741,623
53
500
612

How to make it also work for s2 as shown above? – marlon Sep 16 '22 at 23:00
Just swap `[a-zA-Z0-9]` and `[^a-zA-Z0-9]` – Barmar Sep 16 '22 at 23:01
So I need to replace with twice? – marlon Sep 16 '22 at 23:05
I've updated the answer to show how to do it with one call when using the lookarounds. – Barmar Sep 16 '22 at 23:08
So just use an '|' to do the logical or. – marlon Sep 16 '22 at 23:10
Yes, but this can only be used in the second form, since you don't need to deal with capture groups in each alternative. – Barmar Sep 16 '22 at 23:11
's = s = ' should be 's ='? – marlon Sep 16 '22 at 23:12
IN the second case, the {3} should be moved to the last? – marlon Sep 16 '22 at 23:16
I wasn't sure. I thought the reverse was 3 non-ASCII followed by space followed by ASCII. – Barmar Sep 16 '22 at 23:17
I meant at least one non-ascii + space + at least 3 ascii chars, the opposite of the first case. – marlon Sep 16 '22 at 23:20

score 0 · Answer 2 · answered Sep 17 '22 at 00:40

0

With lookahead and lookbehind

s1 = 'LOGO 设计 SKY  आकाश'

st = re.split(r'(?<=[^a-zA-Z])(?=[a-zA-Z])',s1)

[re.sub(r'\s+','',e) for e in st]

['LOGO设计', 'SKYआकाश']

answered Sep 17 '22 at 00:40

LetzerWille

5,355
4
23
26

How to remove white space in between ascii and nonascii chars?

2 Answers2