I have a pyspark Dataframe
that contain many columns, among them column as an Array type and a String column:
numbers <Array> | name<String>
------------------------------|----------------
["160001","160021"] | A
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["160001","9867", "42345"] | C
------------------------------|----------------
["160001","8650", "2345"] | A
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
I want to skip the numbers that contain 4 digits from the numbers column if the name column is not "B".
And keep it if the name column is "B".
For example:
In the lines 2 and 5, I have "1600" and "2456" contains 4 digits
and the name column is "B", I should keep them from the column values:
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
In the line 3 and 4, I have numbers column that contain a numbers of 4 digit but the column name is different to "B" ==> So I should skip them.
Example:
------------------------------|----------------
["160001","9867", "42345"] | C
------------------------------|----------------
["160001","8650", "2345"] | A
------------------------------|----------------
Expect result:
numbers <Array> | name<String>
------------------------------|----------------
["160001","160021"] | A
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["160001", "42345"] | C
------------------------------|----------------
["160001"] | A
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
How can I do it ? Thank you