I was trying to clean words in list using the following code:
#define function to clean list of words
def clear_list(words_list):
regex = re.compile('[\w\d]{2,}', re.U)
filtered = [i for i in words_list if regex.match(i)]
return filtered
clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))
items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))
I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:
what I have:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[заметки"", дмитрием, крыловым"]
what I need:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[непутевые, заметки, дмитрием, крыловым]