Regular expression to clean strings of words(with accents) and numbers from spaces or other characters at the beggining or end

Question

I'm using python with spark to treat some data with accent words in portuguese.

Some examples of data comes are like this:

 .. -- Água, 1234 ...

 - -- https://www.example.com/page.html *****

I'm trying to remove anything that is not a word or number from the left or right of the string, getting clean results like this:

   Água, 1234
   https://www.example.com/page.html

The best I could do is this:

 ^[^\\p{N}\\p{L}]]|[^\\p{N}\\p{L}]$

But this didn't work. I saw a lot solutions but non matching the beginning and end of string with accent characters.

Thanks in advance.

It takes the accent words away if they are at the beginning or end — Luiz Fernando Lobo, Sep 21 '19 at 23:50
Hey I don't know why you removed your comment but it worked, thanks man :) — Luiz Fernando Lobo, Sep 22 '19 at 00:10
ops sorry, https://stackoverflow.com/questions/18663644/how-to-account-for-accent-characters-for-regex-in-python — αԋɱҽԃ αмєяιcαη, Sep 22 '19 at 00:12

score 1 · Answer 1 · answered Sep 21 '19 at 23:57

Maybe, it'd be OK that we'd look into the data you have, then we'd write some expression similar to:

(?i)\S[a-z].+[a-z0-9]

or,

(?i)\S*[a-z].+[a-z0-9]

Demo

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Test

import re


regex = r"(?i)\S[a-z].+[a-z0-9]"
string = """
.. -- Água, 1234 ...

 - -- https://www.example.com/page.html *****
"""

print(re.findall(regex, string))

Output

['Água, 1234', 'https://www.example.com/page.html']

Luiz Fernando Lobo · Accepted Answer · 2019-09-22T00:49:37.293

I was able to do it.

Thanks to αԋɱҽԃ αмєяιcαη, It's not the best solution because it goes outside of regexp_replace function of pyspark but it works, just added the re.unicode flag, and created a udf.


regexp = re.compile(r'^\W+|\W+$',flags=re.UNICODE)

def remove_non_utf8(string):
    return regexp_2.sub('',regexp_1.sub('',string))

replace_utf8 = udf(remove_non_utf8)

This removes all non unicode characters from the begining or end, used this url as reference.

--EDIT--

I tried using:

**(?ui)^\W+|\W+$**

With the function regexp_replace of pyspark, it didn't work so I'm still with the regexp solution.

Regular expression to clean strings of words(with accents) and numbers from spaces or other characters at the beggining or end

2 Answers2

Demo

Test

Output