Regex: Get rid of consecutive punctuation

Question

I was trying to clean words in list using the following code:

#define function to clean list of words
def clear_list(words_list):
    regex = re.compile('[\w\d]{2,}', re.U)
    filtered = [i for i in words_list if regex.match(i)]
    return filtered

clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))

items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))

I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:

what I have:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[заметки"", дмитрием, крыловым"]

what I need:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[непутевые, заметки, дмитрием, крыловым]

If you make it a python rather a pyspark problem more people will be able to help you. Just a suggestion as it seems to be just about regex. Good job on sharing the input. It's not clear if output is what you are getting or what you are expecting. It's a good idea to include both ideally hard-coded a failing test case as part of your program. — Allan Wind, Sep 25 '21 at 06:31
@AllanWind Thanks. Had adjusted the question according to your advice ;) — Kirill Volkov, Sep 25 '21 at 06:45
Well, it seems all you need is to replace `if regex.match(i)` with `if regex.fullmatch(i)`, if you have a list of words in the `word_list`. Else, use `def clear_list(words_list): return re.findall(r'\b\w{2,}\b', " ".join(word_list))` — Wiktor Stribiżew, Sep 25 '21 at 09:56
@WiktorStribiżew Cool. The second options works , thank you. — Kirill Volkov, Sep 25 '21 at 16:06

score 1 · Answer 1 · answered Sep 25 '21 at 07:31

1

You can use regexp_replace and then filter on the df to achieve the result in pyspark itself.

We should avoid using UDF as much as possible because UDF is like a black box to spark. It can not apply optimizations on it efficiently. Read more here

from pyspark.sql.functions import regexp_replace, col, length

df = df.select(regexp_replace(col("col_name"), "[^a-zA-Z0-9]", ""))
df = df.where(length(col("col_name")) >= 2)

answered Sep 25 '21 at 07:31

Drashti Dobariya

2,455
2
10
23

1

Thank you for very usefull explanation. I didn't know that. Will take it into account. – Kirill Volkov Sep 25 '21 at 16:11

score 1 · Accepted Answer · answered Sep 25 '21 at 07:46

Replace this line:

filtered = [i for i in words_list if regex.match(i)]

With this line:

filtered = [regex.search(i).group() for i in words_list if regex.search(i)]

The regular expression given is good, but the for loop returns the original value, not the matching string. Code sample:

regex = re.compile('[\w\d]{2,}', re.U)
words_list = ['""word', 'wor"', 'c', "test"]
filtered = [regex.search(i).group() for i in words_list if regex.search(i)]
print(filtered)
> ['word', 'wor', 'test']

Regex: Get rid of consecutive punctuation

2 Answers2