How to select records in one pyspark dataframe based on unique records in other or with value as Unknown

Question

I want to rewrite below for loop written in R into Pyspark.

for (i in unique(fix_map[!is.na(area)][order(area), area]))  {
 # select all contact records from the currently processed area, and also those without area assigned
 m_f_0 <- unique(con_melt[area == i | area == "Unknown"])

con_melt also has value as "Unknown"
So I want to select common records which are present in fix_map and con_melt based on "area" "AND" con_melt records for which column 'area' value is also "Unknown".

I tried using join in pyspark, but then I am loosing on value "Unknown".

Please suggest how to handle this

fix_map:

       id        value area type
1: 227149 385911000059  510  mob
2: 122270 385911000661  110  fix

con_melt:

       id area type
1: 227149 510  mob
2: 122270 100  fix
3. 122350 Unknown fix

Ouput should be :

                value       area      type
1:              385994266007 510      mob
2:              122350       Unknown  fix

IMHO - It's better to add sample data with your requirement and mention what issue you are facing. People working on pyspark might not be well versed with R. This post can help you to create reproducible spark example: https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples/48427186#48427186 — Shantanu Sharma, May 21 '19 at 14:16

score 1 · Accepted Answer · answered May 22 '19 at 11:40

Try this -

I kept join, filter and union in separate dataframe for easy explanation. These could be combined.

from pyspark.sql import functions as psf

join_condition = [psf.col('a.area') == psf.col('b.area')]


df1 = fix_map.alias("a").join(con_melt.alias("b"), join_condition).select('a.id','a.area','a.type')

df2 = con_melt.filter("area == 'Unknown'").select('id','area','type')

df1.union(df2).show()

#+------+-------+----+
#|    id|   area|type|
#+------+-------+----+
#|227149|    510| mob|
#|122350|Unknown| fix|
#+------+-------+----+

I have considered area as StringType as it contain 'Unknown'

How to select records in one pyspark dataframe based on unique records in other or with value as Unknown

1 Answers1