1

I have a pyspark Dataframe that contain many columns, among them column as an Array type and a String column:

numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

I want to skip the numbers that contain 4 digits from the numbers column if the name column is not "B". And keep it if the name column is "B". For example:

In the lines 2 and 5, I have "1600" and "2456" contains 4 digits and the name column is "B", I should keep them from the column values:

------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

In the line 3 and 4, I have numbers column that contain a numbers of 4 digit but the column name is different to "B" ==> So I should skip them.

Example:

------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------

Expect result:

    numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001", "42345"]           |     C
------------------------------|----------------
["160001"]                    |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

How can I do it ? Thank you

verojoucla
  • 599
  • 2
  • 12
  • 23

2 Answers2

2

Since Spark 2.4 you can use higher order function FILTER to filter the array. Combining this with if expression should solve the problem:

df.selectExpr("if(name != \'B', FILTER(numbers, x -> length(x) != 4), numbers) AS numbers", "name")
David Vrba
  • 2,984
  • 12
  • 16
1

You need to write a udf to do the array filter and use it with a when clause to apply the udf on a specific condition like where name == B:

from pysparl.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType

filter_array_udf = udf(lambda arr: [x for x in arr if len(x) > 4], "array<string>")
# if string schema is not supported then use the next commented line
# filter_array_udf = udf(filter_array, ArrayType(StringType()))

df = df.withColumn("numbers", when(col("name") == "B", filter_array_udf(col("numbers"))).otherwise(col("numbers")))
pissall
  • 7,109
  • 2
  • 25
  • 45
  • Thank you for your answer I got this error : IllegalArgumentException: u'otherwise() can only be applied on a Column previously generated by when()' – verojoucla Nov 08 '19 at 08:11
  • @verojoucla I have fixed the error. It was due to a bracket mismatch. Thanks :) Please accept and upvote if helped – pissall Nov 08 '19 at 08:24
  • @verojoucla Hahahaha I was trying to solve it before seeing your comment here – pissall Nov 13 '19 at 12:31
  • Hi @pissall :) do you have an idea about this issue please ? Thank you https://stackoverflow.com/questions/59197109/looking-if-string-contain-a-sub-string-in-differents-dataframes/59197875#59197875 – verojoucla Dec 05 '19 at 15:31
  • the second solution is work but only on a small data, when I apply it on a totality of the data doesn't work – verojoucla Dec 05 '19 at 15:56
  • after testing, I remarked the join is not working when I run the code on a totality of data. – verojoucla Dec 05 '19 at 16:18