1

Let's say we join 2 dataframes in pyspark, each one has its alias and they have the same columns:

   joined_df = source_df.alias("source").join(target_df.alias("target"), \ 
               (col("source.A_column") == col("target.A_column")), 'outer')

How can I get a list of column names of joined_df dataframe with aliases from dataframes, something like:

[source.A_column, target.A_column, source.B_column, target.B_column, source.C_column, target.C_column]

It shows in case of analysis exception, so there is obviously the information stored somewhere, but I didn't found a way, how to show it somehow without the exception...

What I tried:

  1. Get the names as above with some direct method or property, but there is no such thing like df.columns_with_alias property
  2. Get list of columns from Dataframe as instances of Column class (because this class saves the alias info), but df.columns just gives you strings... And I found no other way.

Is there any way, how to show this column names?

Lohi
  • 547
  • 4
  • 15
  • Does this answer your question? [How to resolve duplicate column names while joining two dataframes in PySpark?](https://stackoverflow.com/questions/55104036/how-to-resolve-duplicate-column-names-while-joining-two-dataframes-in-pyspark), it basically says there is no way except to rename all columns to have your prefix before joining. – Kashyap Mar 20 '23 at 22:27

2 Answers2

0

I found just really ugly way, which is usable only for debugging purposes, definitely not for production code:

 def show_column_names_with_alias(self, df: DataFrame):    
     try:        
         df.select("this_column_name_is_not_in_the_dataframe")    
    except Exception as e:        
        import re        
        print(re.search("\[(.*)\]", e.desc).group(1))

which should raise the exception, where there is the columns with aliases info in the exception message. I parse the message for the print and that's it... It works, but there has to be better way. The output is:

source.A_column, target.A_column, source.B_column, target.B_column, source.C_column, target.C_column
Lohi
  • 547
  • 4
  • 15
0

Use a function to generated new column name

def add_alias(columns, alias):
    return [f"{alias}_{column}" for column in columns]

Case 1: If you have access to source and target DataFrame

  1. Use the toDF() functions to change entire column names
source_df = source_df.toDF(*add_alias(source_df.columns, "source"))
target_df = target_df.toDF(*add_alias(target_df.columns, "target"))
  1. Join using this new column name
source_df.join(target_df, col("source_id") == col("target_id"), "outer").show()

Output:

+---------+-------------+---------+-------------+
|source_id|source_result|target_id|target_result|
+---------+-------------+---------+-------------+
|        a|           35|        a|           35|
|        b|           20|        b|           20|
+---------+-------------+---------+-------------+

Case 2: If you don't have access to Source and Target DataFrame

  1. Add alias with both source and target DataFrame
joined_df = source_df.alias("src").join(
            target_df.alias("tgt"), source_df.id == target_df.id, "outer")

Note: In this case the all the source and target DataFrame will be in left and right side respectively.

  1. use the alias name of the DataFrame to extract the column name
joined_source_columns = joined_df.select("src.*").columns
joined_target_columns = joined_df.select("tgt.*").columns
  1. Use this data with the add_alias() function to get your modified column name
joined_df_columns = add_alias(joined_source_columns, "source") + add_alias(joined_target_columns, "target")

joined_df.toDF(*joined_df_columns).show()

Output:

+---------+-------------+---------+-------------+
|source_id|source_result|target_id|target_result|
+---------+-------------+---------+-------------+
|        a|           35|        a|           35|
|        b|           20|        b|           20|
+---------+-------------+---------+-------------+
arudsekaberne
  • 830
  • 4
  • 11
  • Nice way, if you have access to original source and target df's before join. But what if I get already joined dataframe? Or if I need to keep the column names for further processing? As I say in the question, is it possible to get the information from joined_df itself? – Lohi Mar 17 '23 at 11:46