3

I have a column in data frame named as "tags". I need to extract the values based on the condition. The condition is it should not contains _(Underscore) and :(Colon).

For example:

"tags": "hai, hello, amount_10, amount_90, total:100"

Expected result:

"new_column" : "hai, hello"

For your information:

I extracted all the amount tags by

collectAmount = udf(lambda s: list(map(lambda amount: amount.split('_')[1] if len(collection) > 0
                        else amount, re.findall(r'(amount_\w+)', s))), ArrayType(StringType()))

productsDF = productsDF.withColumn('amount_tag', collectAmount('tags'))
Jan
  • 42,290
  • 8
  • 54
  • 79

3 Answers3

5

Try this

df.withColumn('new_column',expr('''concat_ws(',',array_remove(transform(split(tags,','), x -> regexp_extract(x,'^(?!.*_)(?!.*:).+$',0)),''))''')).show(2,False)

+-------------------------------------------+----------+
|tags                                       |new_column|
+-------------------------------------------+----------+
|hai, hello, amount_10, amount_90, total:100|hai, hello|
|hai, hello, amount_10, amount_90, total:100|hai, hello|
+-------------------------------------------+----------+
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
3

No regex needed really:

tags = ["hai", "hello", "amount_10", "amount_90", "total:100"]

new_column = [tag for tag in tags if not any(junk in tag for junk in ["_", ":"])]
print(new_column)

If you insist on using regular expressions:

import re
rx = re.compile(r'^(?!.*_)(?!.*:).+$')
new_column = [tag for tag in tags if rx.match(tag)]
print(new_column)

See a demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
2

You can use a regex as per the answer above, but you will need to either wrap it in a udf or as I show below, use the pyspark built-ins:

from pyspark.sql import functions as F

df= df.withColumn("extracted", F.regexp_extract("tags","[_:]", 0))
df.filter(df["extracted"] == '').select("tags").show()
ags29
  • 2,621
  • 1
  • 8
  • 14