1

I have two tables as follows:

Table 1:

enter image description here

Table 2:

enter image description here

I want to replace the names of the Table 1 with the date column from table 2.

The final output should look like the below table:

enter image description here

All help is appreciated!

Thank You!

T.S
  • 51
  • 1
  • 10

1 Answers1

1

I assume that the table 2 is not so huge, considering that they are column name mappings, otherwise there will be memory issues in bringing them to driver. Try this.

tst1=sqlContext.createDataFrame([(1,2,3,4,5,6,7,8),(5,6,7,8,9,10,11,12),(13,14,15,16,17,18,19,20)],["a","b","c","d","e","f","g","h"])
tst2=sqlContext.createDataFrame([('a','apple'),('b','ball'),('c','cat'),('d','dog'),('e','elephant'),('f','fox'),('g','goat'),('h','hat')],["short","long"])
tst1.show()
+---+---+---+---+---+---+---+---+
|  a|  b|  c|  d|  e|  f|  g|  h|
+---+---+---+---+---+---+---+---+
|  1|  2|  3|  4|  5|  6|  7|  8|
|  5|  6|  7|  8|  9| 10| 11| 12|
| 13| 14| 15| 16| 17| 18| 19| 20|
+---+---+---+---+---+---+---+---+
# Collect the table 2 to extract the mapping
tst_cl = tst2.collect()
# get the old and new names of the columns
old_name=[str(tst_cl[i][0]) for i in range(len(tst_cl))]
new_name=[str(tst_cl[i][1]) for i in range(len(tst_cl))]
# Rename the columns
tst_rn = tst1.select(old_name).toDF(*new_name)
tst_rn.show()
+-----+----+---+---+--------+---+----+---+
|apple|ball|cat|dog|elephant|fox|goat|hat|
+-----+----+---+---+--------+---+----+---+
|    1|   2|  3|  4|       5|  6|   7|  8|
|    5|   6|  7|  8|       9| 10|  11| 12|
|   13|  14| 15| 16|      17| 18|  19| 20|
+-----+----+---+---+--------+---+----+---+

Once you have collected the column mappings you can use any of the renaming techniques used here : PySpark - rename more than one column using withColumnRenamed

Hint : Should you face some order mismatch issues during collect(mostly you won't,but just if you want to be triple sure ) then consider combining the mapping in table 2 using F.array() method and then collect. The mapping has to be slightly changed

tst_array= tst2.withColumn("name_array",F.array(F.col('short'),F.col('long')))
tst_clc = tst_array.collect()
old_name = [str(tst_clc[i][2][0]) for i in range(len(tst_clc))]
new_name = [str(tst_clc[i][2][1]) for i in range(len(tst_clc))]
Raghu
  • 1,644
  • 7
  • 19