Detect latin characters in regex

Question

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.

def clean_str(string):
    string = re.sub(r"#(@[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
    string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
    return string.lower().strip()

My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.

example: if I have a text like "@aaa bbb các. ddd".

it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "@aaa".

But it produces the same input text!: "@aaa bbb các. ddd"

Did I miss something?

In that question you linked to, the answer is not about adding `#` (you have `@` here), but they used `\w` and `re.U` flag to make `\w` Unicode aware. You cannot expect `[A-Za-z]` to match `ł` just because you added some flag. Replace your `#[a-zA-Z_0-9]+` with `@\w+` — Wiktor Stribiżew, May 02 '18 at 19:27
How have you made sure that *the regex work in detecting the latin characters*? — revo, May 02 '18 at 19:31
Adding to @WiktorStribiżew's wonderful reply, `\w` matches `[a-zA-Z0-9_]` (and other variants with `U` flag). To have `\w` not match `_` you can use `[^\W_]`. Similarily, to only match `[a-zA-Z]` and its Unicode variants (without digits) you can use `[^\W_\d]` — ctwheels, May 02 '18 at 19:32
@ctwheel , it's the same for me both .. but in my regex I want to process uncharacters line "DOT, comma" etc. — Minions, May 02 '18 at 19:37
By "uncharacters", you must mean "non-word and non-space" chars, right? You may match them with `[^\w\s]` — Wiktor Stribiżew, May 02 '18 at 19:52
@WiktorStribiżew, ahhaah srry, tto tired .. Special characters I meant ex. (!@#$%>.,) — Minions, May 02 '18 at 19:52
Oh, there is another issue here. You need `flags=re.U` in all your `re.sub`s — Wiktor Stribiżew, May 02 '18 at 20:16

score 1 · Accepted Answer · edited Nov 20 '22 at 19:04

You have several issues in the current code:

To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.

See an updated method to clean the strings:

>>> def clean_str(s):
...     s = re.sub(r'@\w+', ' ', s, flags=re.U)
...     s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
...     s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
...     return s.lower().strip()
...
>>> print(clean_str(s))

Then replace `'[^\w\s]+'` with `'[^\w\s]'` to match special characters individually, not as a sequence of consecutive chars. — Wiktor Stribiżew, May 02 '18 at 20:50

Detect latin characters in regex

1 Answers1