3

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.

def clean_str(string):
    string = re.sub(r"#(@[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
    string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
    return string.lower().strip()

My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.

example: if I have a text like "@aaa bbb các. ddd".

it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "@aaa".

But it produces the same input text!: "@aaa bbb các. ddd"

Did I miss something?

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
Minions
  • 5,104
  • 5
  • 50
  • 91
  • 1
    In that question you linked to, the answer is not about adding `#` (you have `@` here), but they used `\w` and `re.U` flag to make `\w` Unicode aware. You cannot expect `[A-Za-z]` to match `ł` just because you added some flag. Replace your `#[a-zA-Z_0-9]+` with `@\w+` – Wiktor Stribiżew May 02 '18 at 19:27
  • How have you made sure that *the regex work in detecting the latin characters*? – revo May 02 '18 at 19:31
  • @WiktorStribiżew , it works! .. – Minions May 02 '18 at 19:32
  • 1
    Adding to @WiktorStribiżew's wonderful reply, `\w` matches `[a-zA-Z0-9_]` (and other variants with `U` flag). To have `\w` not match `_` you can use `[^\W_]`. Similarily, to only match `[a-zA-Z]` and its Unicode variants (without digits) you can use `[^\W_\d]` – ctwheels May 02 '18 at 19:32
  • @revo , you can test it, it works – Minions May 02 '18 at 19:32
  • @ctwheel , it's the same for me both .. but in my regex I want to process uncharacters line "DOT, comma" etc. – Minions May 02 '18 at 19:37
  • By "uncharacters", you must mean "non-word and non-space" chars, right? You may match them with `[^\w\s]` – Wiktor Stribiżew May 02 '18 at 19:52
  • @WiktorStribiżew, ahhaah srry, tto tired .. Special characters I meant ex. (!@#$%>.,) – Minions May 02 '18 at 19:52
  • Oh, there is another issue here. You need `flags=re.U` in all your `re.sub`s – Wiktor Stribiżew May 02 '18 at 20:16

1 Answers1

1

You have several issues in the current code:

  • To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
  • When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
  • To match any non-word char but a whitespace, you may use [^\w\s]
  • When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.

See an updated method to clean the strings:

>>> def clean_str(s):
...     s = re.sub(r'@\w+', ' ', s, flags=re.U)
...     s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
...     s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
...     return s.lower().strip()
...
>>> print(clean_str(s))
ccpizza
  • 28,968
  • 18
  • 162
  • 169
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563