pyspark left outer join - can't get blended result

Question

I want to join 2 pyspark dfs, where df_template has all the columns and rows that I need in the output and df_proc has data in it for some (but not all) of the row/column combinations in df_template. The code I'm using is:

df_blend = df_template.join(df_proc, ["metro_area"],"left").select(df_template["*"])

But all this returns is the original df_template:

+----------+-----------+-----------+-----------+-----------+-----------+-----------+
|metro_area| option_001| option_002| option_003| option_004| option_005| option_006|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+
| A10000501|       null|       null|       null|       null|       null|       null|
| A10000502|       null|       null|       null|       null|       null|       null|
| A10000503|       null|       null|       null|       null|       null|       null|
| A10000504|       null|       null|       null|       null|       null|       null|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+

This seems pretty basic but I just can't figure out how to get my desired result, any suggestions ??? Here is what I want the output to look like ...

+----------+-----------+-----------+-----------+-----------+-----------+-----------+
|metro_area| option_001| option_002| option_003| option_004| option_005| option_006|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+
| A10000501|       1455|         26|         19|         65|         38|       null|
| A10000502|        654|       1876|       1950|        886|       null|       null|
| A10000503|       null|       null|       null|       null|       null|       null|
| A10000504|        774|        854|       1012|        271|       null|       null|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+

For reference, here are the original dataframes. df_template

+----------+-----------+-----------+-----------+-----------+-----------+-----------+
|metro_area| option_001| option_002| option_003| option_004| option_005| option_006|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+
| A10000501|       null|       null|       null|       null|       null|       null|
| A10000502|       null|       null|       null|       null|       null|       null|
| A10000503|       null|       null|       null|       null|       null|       null|
| A10000504|       null|       null|       null|       null|       null|       null|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+

df_proc

+----------+-----------+-----------+-----------+-----------+-----------+-----------+
|metro_area| option_001| option_002| option_003| option_004| option_005| option_006|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+
| A10000502|        654|       1876|       1950|        886|       null|       null|
| A10000504|        774|        854|       1012|        271|       null|       null|
| Al0000501|       1455|         26|         19|         65|         38|       null|
+----------+-----------+-----------+-----------+-----------+-----------+-----------+

Few stuff, you need to remove the select at the end. ` df_blend = df_template.join(df_proc, ["metro_area"],"left") ` . This should give you the expected output. But i see your df_template has all columns as null other than join key, by which this will also put up duplicate columns as wel from df_template dataframe . Hence when you run select of it , it is showing df_template. if df_template contains non-null values then removing select option from your above query is correct. — Usman Azhar, Jan 14 '19 at 00:20
Thanks Usman. When I remove the select, I get the columns duplicated (excepting metro area). I am trying to using the df_template to ensure that all of the columns are in the final result, regardless of whether there is data for them in df_proc. I need all the columns there because that is how the table is defined in Athena (df_blend is eventually written to S3) — Thom Rogers, Jan 14 '19 at 03:09
I can't accept because the question has not been answered: How do I join df_template to df_proc so that I have no duplicate columns AND the data in df_proc appears in the final df_blend? — Thom Rogers, Jan 14 '19 at 05:17
Got it, but you will get duplicate columns if all the columns are set to null value, if any of the columns has a value ,i.e any of the column option_*** has a value in it, the above query should work. I am not sure, why its unable to join it when columns are having null value. — Usman Azhar, Jan 14 '19 at 15:50
you can also refer this solution to fix the duplicates... https://stackoverflow.com/questions/44372413/how-to-remove-duplicate-columns-with-their-non-null-value-after-join — Usman Azhar, Jan 14 '19 at 19:05

pyspark left outer join - can't get blended result

0 Answers0