Regular expression to find all the string that does not contains _(Underscore) and :(Colon) in PySpark Dataframe column

Question

I have a column in data frame named as "tags". I need to extract the values based on the condition. The condition is it should not contains _(Underscore) and :(Colon).

For example:

"tags": "hai, hello, amount_10, amount_90, total:100"

Expected result:

"new_column" : "hai, hello"

For your information:

I extracted all the amount tags by

collectAmount = udf(lambda s: list(map(lambda amount: amount.split('_')[1] if len(collection) > 0
                        else amount, re.findall(r'(amount_\w+)', s))), ArrayType(StringType()))

productsDF = productsDF.withColumn('amount_tag', collectAmount('tags'))

What is the type of the tags column? Can you please create a [minimal reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples)? — cronoik, Jul 21 '20 at 07:16

score 5 · Accepted Answer · answered Jul 21 '20 at 07:55

5

Try this

df.withColumn('new_column',expr('''concat_ws(',',array_remove(transform(split(tags,','), x -> regexp_extract(x,'^(?!.*_)(?!.*:).+$',0)),''))''')).show(2,False)

+-------------------------------------------+----------+
|tags                                       |new_column|
+-------------------------------------------+----------+
|hai, hello, amount_10, amount_90, total:100|hai, hello|
|hai, hello, amount_10, amount_90, total:100|hai, hello|
+-------------------------------------------+----------+

answered Jul 21 '20 at 07:55

Shubham Jain

5,327
2
15
38

Perfect...! @Shubham Jain – codingIsInteresting Jul 21 '20 at 08:02
3

Use a simpler regex, `regexp_extract(x,'^[^_:]+$',0)` – Wiktor Stribiżew Jul 21 '20 at 09:07

Jan · Answer 2 · 2020-07-21T07:08:12.650

3

No regex needed really:

tags = ["hai", "hello", "amount_10", "amount_90", "total:100"]

new_column = [tag for tag in tags if not any(junk in tag for junk in ["_", ":"])]
print(new_column)

If you insist on using regular expressions:

import re
rx = re.compile(r'^(?!.*_)(?!.*:).+$')
new_column = [tag for tag in tags if rx.match(tag)]
print(new_column)

See a demo on regex101.com.

edited Jul 21 '20 at 07:08

answered Jul 21 '20 at 07:02

Jan

42,290
8
54
79

Hello @Jan. It's not a normal list in python. It's pyspark data frame. – codingIsInteresting Jul 21 '20 at 07:05

score 2 · Answer 3 · answered Jul 21 '20 at 07:35

2

You can use a regex as per the answer above, but you will need to either wrap it in a udf or as I show below, use the pyspark built-ins:

from pyspark.sql import functions as F

df= df.withColumn("extracted", F.regexp_extract("tags","[_:]", 0))
df.filter(df["extracted"] == '').select("tags").show()

answered Jul 21 '20 at 07:35

ags29

2,621
1
8
14

It's not giving exact result @ags29. – codingIsInteresting Jul 21 '20 at 07:47
+----------+ |plain_tags| +----------+ | :| | _| | _| | _| I 'm getting this alone – codingIsInteresting Jul 21 '20 at 07:48
yes, please. if you can provide a sample then I will amend my answer. – ags29 Jul 21 '20 at 07:57

Regular expression to find all the string that does not contains _(Underscore) and :(Colon) in PySpark Dataframe column

3 Answers3