Conditionally populate a new column in a spark dataframe based on the content extracted with regex of another column

Question

I have a spark dataframe which contains the content of a json file. I need to create a new column which is populated conditionally based on the content of another column.

Let's say I have a column containing some numbers and my new column will be populated depending on the value of this numbers (eg: first column has a number which is lower than 5, my new column will populated with the string 'lower than five', if the value was greater that 5 the new column would be populated with 'greater than five).

I know that I can do something like this with the when function:

file.withColumn('newcolumn', \
                F.when(file.oldColumn < 5, 'Lower than five') \
                .when(file.oldColumn > 5, 'Greater than five').show()

but what if 'oldColumn' does not have just integers but it contains string from which I need to extract the integer:

eg 'PT5M' and I need to extract the 5 and I need to consider a string like 'PTM' which does not contain a number as 0

So far I manage to extract the number for my first column using regexp_extract but I am struggling with turning the null values into 0

example where 1 is the original column and 2 is the new column:

+-------+-------------------+
|1      |  2                |
+-------+-------------------+
|PT5M   |  Lower than five  |   
|PT10M  |  Greater than five|    
|PT11M  |  Greater than five|        
+-------+-------------------+

Thanks for your help!

how many numeric characters can the string have? will they always be consecutive? — Vamsi Prabhala, Jan 24 '19 at 16:37
Please share a small sample DataFrame and describe your desired output. [How to create good reproducible DataFrame Examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — pault, Jan 24 '19 at 16:42
I have updated the description of the problem with an example. @VamsiPrabhala the integers in the string will always be before a standard character, anyway easily extractable with a regex — user2697881, Jan 24 '19 at 16:54

Vamsi Prabhala · Accepted Answer · 2019-01-24T17:08:00.967

3

Using regexp_replace to replace non-numerics with empty strings and then using when to set the column value.

file.withColumn('newcolumn', \
                F.when(F.regexp_replace(file.oldColumn,'[^0-9]','') == '','Lower than five')\ 
                 .when(F.regexp_replace(file.oldColumn,'[^0-9]','').cast('int') < 5, 'Lower than five') \
                 .otherwise('Greater than five')).show()

edited Jan 24 '19 at 17:08

answered Jan 24 '19 at 17:03

Vamsi Prabhala

48,685
4
36
58

how would this work if the first column does not have any integers to be extracted? ideally that has to be treated as 0 – user2697881 Jan 24 '19 at 17:04
and how would this change if my first column would be in the format PT1H15M with every number before an 'H' is the representation of the amount of hours and every number before the 'M' the representation of minutes. What would be a clever way to extract these information considering that I might want to normalize all the numbers to minutes? – user2697881 Jan 24 '19 at 17:46
Please post a new question with the additional requirement. – Vamsi Prabhala Jan 24 '19 at 19:45

stack0114106 · Answer 2 · 2019-01-24T18:36:38.073

There are many ways

scala> val df = Seq("PT5M","PT10M","PT11M").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.show(false)
+-----+
|a    |
+-----+
|PT5M |
|PT10M|
|PT11M|
+-----+

scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).show(false)
+-----+---+
|a    |b  |
+-----+---+
|PT5M |5  |
|PT10M|10 |
|PT11M|11 |
+-----+---+


scala> df.withColumn("b",regexp_extract('a,"""\D*(\d+)\D*""",1)).withColumn("c", when('b.cast("int") < 5, "Lower than five").when('b.cast("int") > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a    |b  |c                |
+-----+---+-----------------+
|PT5M |5  |null             |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
+-----+---+-----------------+


scala>

If there are no digits in the value and you want to default it to 0, then you could use coalesce()

scala> val df = Seq("PT5M","PT10M","PT11M", "XXM").toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]

scala> df.show
+-----+
|    a|
+-----+
| PT5M|
|PT10M|
|PT11M|
|  XXM|
+-----+


scala> df.withColumn("b",coalesce(regexp_extract('a,"""\D*(\d+)\D*""",1).cast("int"),lit(0))).withColumn("c", when('b < 5, "Lower than five").when('b > 5, "Greater than five").otherwise("null")).show(false)
+-----+---+-----------------+
|a    |b  |c                |
+-----+---+-----------------+
|PT5M |5  |null             |
|PT10M|10 |Greater than five|
|PT11M|11 |Greater than five|
|XXM  |0  |Lower than five  |
+-----+---+-----------------+


scala>

trying to do the same thing but coalesce strings, not sure why it's not working: df.withColumn("b",coalesce(regexp_extract('a,"""\D*(\d+)\D*""",1).cast(StringType), lit("1.0"))).show() — Jelly Wu, Apr 16 '20 at 17:28

cph_sto · Answer 3 · 2019-01-24T20:14:29.280

from pyspark.sql.functions import regexp_extract, when
myValues = [('PT5M',),('PT10M',),('PT11M',),('PT',)]
df = sqlContext.createDataFrame(myValues,['1'])
df.show()
+-----+
|    1|
+-----+
| PT5M|
|PT10M|
|PT11M|
|   PT|
+-----+

df = df.withColumn('interim',regexp_extract(df['1'],'\d+',0))
df = df.withColumn('2', when(df['interim'] < 5, 'Lower than five').when(df['interim'] > 5, 'Greater than five').when(df['interim']=='','Lower than five')).drop('interim')
df.show()
+-----+-----------------+
|    1|                2|
+-----+-----------------+
| PT5M|             null|
|PT10M|Greater than five|
|PT11M|Greater than five|
|   PT|  Lower than five|
+-----+-----------------+

Conditionally populate a new column in a spark dataframe based on the content extracted with regex of another column

3 Answers3