How to show column names of Pyspark joined DataFrame with dataframe aliases?

Question

Let's say we join 2 dataframes in pyspark, each one has its alias and they have the same columns:

   joined_df = source_df.alias("source").join(target_df.alias("target"), \ 
               (col("source.A_column") == col("target.A_column")), 'outer')

How can I get a list of column names of joined_df dataframe with aliases from dataframes, something like:

[source.A_column, target.A_column, source.B_column, target.B_column, source.C_column, target.C_column]

It shows in case of analysis exception, so there is obviously the information stored somewhere, but I didn't found a way, how to show it somehow without the exception...

What I tried:

Get the names as above with some direct method or property, but there is no such thing like df.columns_with_alias property
Get list of columns from Dataframe as instances of Column class (because this class saves the alias info), but df.columns just gives you strings... And I found no other way.

Is there any way, how to show this column names?

Does this answer your question? [How to resolve duplicate column names while joining two dataframes in PySpark?](https://stackoverflow.com/questions/55104036/how-to-resolve-duplicate-column-names-while-joining-two-dataframes-in-pyspark), it basically says there is no way except to rename all columns to have your prefix before joining. — Kashyap, Mar 20 '23 at 22:27

score 0 · Answer 1 · answered Mar 17 '23 at 10:00

I found just really ugly way, which is usable only for debugging purposes, definitely not for production code:

 def show_column_names_with_alias(self, df: DataFrame):    
     try:        
         df.select("this_column_name_is_not_in_the_dataframe")    
    except Exception as e:        
        import re        
        print(re.search("\[(.*)\]", e.desc).group(1))

which should raise the exception, where there is the columns with aliases info in the exception message. I parse the message for the print and that's it... It works, but there has to be better way. The output is:

source.A_column, target.A_column, source.B_column, target.B_column, source.C_column, target.C_column

arudsekaberne · Answer 2 · 2023-03-17T12:25:52.287

Use a function to generated new column name

def add_alias(columns, alias):
    return [f"{alias}_{column}" for column in columns]

Case 1: If you have access to source and target DataFrame

Use the toDF() functions to change entire column names

source_df = source_df.toDF(*add_alias(source_df.columns, "source"))
target_df = target_df.toDF(*add_alias(target_df.columns, "target"))

Join using this new column name

source_df.join(target_df, col("source_id") == col("target_id"), "outer").show()

Output:

+---------+-------------+---------+-------------+
|source_id|source_result|target_id|target_result|
+---------+-------------+---------+-------------+
|        a|           35|        a|           35|
|        b|           20|        b|           20|
+---------+-------------+---------+-------------+

Case 2: If you don't have access to Source and Target DataFrame

Add alias with both source and target DataFrame

joined_df = source_df.alias("src").join(
            target_df.alias("tgt"), source_df.id == target_df.id, "outer")

Note: In this case the all the source and target DataFrame will be in left and right side respectively.

use the alias name of the DataFrame to extract the column name

joined_source_columns = joined_df.select("src.*").columns
joined_target_columns = joined_df.select("tgt.*").columns

Use this data with the add_alias() function to get your modified column name

joined_df_columns = add_alias(joined_source_columns, "source") + add_alias(joined_target_columns, "target")

joined_df.toDF(*joined_df_columns).show()

Output:

+---------+-------------+---------+-------------+
|source_id|source_result|target_id|target_result|
+---------+-------------+---------+-------------+
|        a|           35|        a|           35|
|        b|           20|        b|           20|
+---------+-------------+---------+-------------+

Nice way, if you have access to original source and target df's before join. But what if I get already joined dataframe? Or if I need to keep the column names for further processing? As I say in the question, is it possible to get the information from joined_df itself? — Lohi, Mar 17 '23 at 11:46

How to show column names of Pyspark joined DataFrame with dataframe aliases?

2 Answers2