1

I have a dataFrame as below

-----------------------------
|  A   |   B       |   C    |  
-----------------------------
|  1   |  col_1   |   val1   |  
|  1   |  col_2   |   val2   |  
|  1   |  col_3   |   val3   |  
|  1   |  col_4   |   val4   |  
-----------------------------

I need to convert this as below

-----------------------------------------------
|  A   |  col_1  |   col_2  |   col_3 | col_4 |
-----------------------------------------------
|  1   |  val1   |  val2    |  val3  |  val4  |
-----------------------------------------------

How to do this using scala + spark-sql ? any clue is highly appreciated.

SCouto
  • 7,808
  • 5
  • 32
  • 49
BdEngineer
  • 2,929
  • 4
  • 49
  • 85

1 Answers1

4

a groupBy and an agg clause should be enough:

df.groupBy("A").pivot("B").agg(first("C"))

with the groupBy first you group your possibles values, then pivot based on the key column (B in your case) and get the value (C)

INPUT

df.show
+---+-----+----+
|  A|    B|   C|
+---+-----+----+
|  1|col_1|val1|
|  1|col_2|val2|
|  1|col_3|val3|
|  1|col_4|val4|
+---+-----+----+

OUTPUT

+---+-----+-----+-----+-----+
|  A|col_1|col_2|col_3|col_4|
+---+-----+-----+-----+-----+
|  1| val1| val2| val3| val4|
+---+-----+-----+-----+-----+
SCouto
  • 7,808
  • 5
  • 32
  • 49
  • thank you , let me check it , Why are you using first here in .agg(first("C")) ? – BdEngineer Dec 18 '18 at 09:55
  • Because you need, an aggregation function (sum, max, min) in your case, first will do the trick, since you have a single value for each group, it will take the first (unique) element. – SCouto Dec 18 '18 at 10:01
  • it works sir , thank you so much , sorry for delay , sir have two small issues , pivote columns are in upper_case, integer , need to convert them to lower_case , doube with 6 precision how to do dynamically ??? if C column value is null or 0 , I need to take column D value ...how can be achieved ? Assume there is another column D in the original dataframe. – BdEngineer Dec 19 '18 at 11:43
  • sir i got lower() function to convert , but how to do rest two ? i.e. changing the pivot columns datatype to double from integer AND if C column value is null or 0 , I need to take column D value? – BdEngineer Dec 19 '18 at 11:54
  • The first is just a Map. For the second hoy cans use the when operator inside org.apache.spark.sql.functions package. I think You should open a New question with a proper example for that last one – SCouto Dec 19 '18 at 13:23
  • sir I tried it , getting another error , can you please help me ...https://stackoverflow.com/questions/53854262/expression-is-neither-present-in-the-group-by-nor-is-it-an-aggregate-function – BdEngineer Dec 19 '18 at 15:23
  • , sir one more thing , if the values of Pivot column (i.e. B column in INPUT) are varied ...i.e. some cases 10 , other case 10+ , I need to save the OUTPUT to cassandra table , .... cassandra table columns i.e. pivote columns are dynamically change...then how to handle this kind of scenario?? – BdEngineer Dec 20 '18 at 12:04
  • Hi. I know little about Cassandra. Please, ask a New question so a Cassandra develop can help You. Also is a good idea in this aire to keep the comments related to the question. – SCouto Dec 20 '18 at 16:53
  • sure sir thank you....are you on IM? sir do you have idea of paritioning in spark ? https://stackoverflow.com/questions/53442130/why-only-one-core-is-taking-all-the-load-how-to-make-other-29-cores-to-take-lo?noredirect=1#comment93902947_53442130 – BdEngineer Dec 21 '18 at 08:00
  • Sir in agg function of spark I need to impliment multiple conditions liake below how to do it ```` If(Data_item_code) is CREDITSCORE PG_SUPPORT_STANDALONE_SCORE ====> pick data_item_value_string as the value else ==> pick data_item_value_numeric as the value ```` – BdEngineer Feb 07 '19 at 12:53