I have two tables as follows:
Table 1:
Table 2:
I want to replace the names of the Table 1 with the date column from table 2.
The final output should look like the below table:
All help is appreciated!
Thank You!
I assume that the table 2 is not so huge, considering that they are column name mappings, otherwise there will be memory issues in bringing them to driver. Try this.
tst1=sqlContext.createDataFrame([(1,2,3,4,5,6,7,8),(5,6,7,8,9,10,11,12),(13,14,15,16,17,18,19,20)],["a","b","c","d","e","f","g","h"])
tst2=sqlContext.createDataFrame([('a','apple'),('b','ball'),('c','cat'),('d','dog'),('e','elephant'),('f','fox'),('g','goat'),('h','hat')],["short","long"])
tst1.show()
+---+---+---+---+---+---+---+---+
| a| b| c| d| e| f| g| h|
+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+---+---+---+---+---+---+---+---+
# Collect the table 2 to extract the mapping
tst_cl = tst2.collect()
# get the old and new names of the columns
old_name=[str(tst_cl[i][0]) for i in range(len(tst_cl))]
new_name=[str(tst_cl[i][1]) for i in range(len(tst_cl))]
# Rename the columns
tst_rn = tst1.select(old_name).toDF(*new_name)
tst_rn.show()
+-----+----+---+---+--------+---+----+---+
|apple|ball|cat|dog|elephant|fox|goat|hat|
+-----+----+---+---+--------+---+----+---+
| 1| 2| 3| 4| 5| 6| 7| 8|
| 5| 6| 7| 8| 9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+-----+----+---+---+--------+---+----+---+
Once you have collected the column mappings you can use any of the renaming techniques used here : PySpark - rename more than one column using withColumnRenamed
Hint : Should you face some order mismatch issues during collect(mostly you won't,but just if you want to be triple sure ) then consider combining the mapping in table 2 using F.array() method and then collect. The mapping has to be slightly changed
tst_array= tst2.withColumn("name_array",F.array(F.col('short'),F.col('long')))
tst_clc = tst_array.collect()
old_name = [str(tst_clc[i][2][0]) for i in range(len(tst_clc))]
new_name = [str(tst_clc[i][2][1]) for i in range(len(tst_clc))]