0

I have XML file which contain all transformation that I need to run over DataFrame using withColumn function like below: How I can apply it over DataFrame.

I had a written code using Scala ToolBox and runTmirror, which internally compile code and run these rules over DataFrame. Which was working perfectly for less than 100 Columns. But now requirement has been changed and Number of columns have increased from 80 to 210 so this code is failing due StackOverflow error. Which is open issue for Scala 2.11 (https://github.com/scala/bug/issues/10026)

So I want to use any Spark utility instead of Scala ToolBox. I have also tried to use foldLeft but it is also giving error since I am not able to pass column function (like lit or concat etc.) as Column type.

XML Rule Files:

    <?xml version="1.0" encoding="utf-8" ?> 
    - <root>
    - <columns>
    - <column name="col1">
    - <![CDATA[ data("columnA")        
      ]]> 
      </column>
    - <column name="col2">
    - <![CDATA[lit("ABC")

      ]]> 
      </column>
    - <column name="col3">
    - <![CDATA[concat(col(columnC),col(columnD))      
      ]]> 
      </column>
      </column>
    - <column name="col4">
    - <![CDATA[         regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double")

      ]]> 
      </column>
    - <column name="col5">
    - <![CDATA[                 lit("")

      ]]> 
      </column>
.
.
.
.
. 
     </columns>
      </root>

Operations that I need to use as

df.withColumn("col1",data("columnA")).withColumn("col2",lit("ABC")).withColumn("col3",concat(col(columnC), col(columnD))).withColumn("col4",regexp_replace(regexp_replace(regexp_replace(col("ColumnE"), "\\,|\\)", "") , "\\(", "-") , "^(-)$", "0").cast("double"))withColumn("col5",lit("")).........

Version that I am using:

Scala 2.11.12

Spark 2.4.3

Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24
  • Can you provide the code and the error you got with foldLeft? I thinks that should be the correct way to do it. Something like ```scala colsFromXML.foldLeft(df, newColMap) { df => df.withColumn(newColMap("name"), newColMap("colExpr") } ``` Probably you'll need to preprocess the xml a little, but I don't see where you need to have the data type of the resulting column – Paul Sep 01 '19 at 11:13
  • Thanks for replying, I have tried to read XML file and map to List as ("col1","(concat (col2,col3)") and use it something like list.foldLeft (DF,listmap){DF.withColumn(listmap(0),listmap(1))}. I knew that I am reading method like concat and lit as String from Xml so it gives error that found String , required Column type ( which is basically type of lit and other column function). If you can help me to read XML file formatted as above mentioned and use it as I mentioned. If not possible then could you please provide me more details like what is colsFromXML and how it actually work.. – Nikhil Suthar Sep 01 '19 at 12:45
  • ok I see the problem, you basically need to interpret the string you get from the xml as Code. This can be done as in this answer https://stackoverflow.com/questions/39142979/dynamic-code-evaluation-in-scala. If it is too hacky for you to evaluate strings as code you might need to write some sort of custom parser, that recognizes spark.sql.function functions and returns the correct column expressions. This is obviously a lot of work – Paul Sep 01 '19 at 13:30
  • Thank Paul, I have checked stackoverflow.com/questions/39142979/…. and found that my existing system has same logic (using ToolBox) that has mentioned in Answer. Which is also working but as I mentioned in description, that method gives StackOverFlow Error when we work with more that 100 Columns(https://github.com/scala/bug/issues/10026). That's why I was looking for Spark Solution instead of Scala. No doubt I have option to use Spark SQL and run it but it will be complex. So finally I have decided to go with a logic which I have given in Answer and it is working perfectly. – Nikhil Suthar Sep 02 '19 at 09:49

1 Answers1

0

I have tried to get Solution from Spark but did't get any solution other than Spark Sql. But my rules are very complex and using Spark Sql will make it more complex, So I stuck with same logic of ToolBox (as Existing System work) and have resolved issue of 100 columns as below:

  1. First I read all rules from XML and generate CodeRule by concat with .withColumn (same as existing code)
  2. Then I check occurrence of String ".withColumn" if it greater 100 (say 133 Columns) then simply I divide CodeRule into two parts (one from 1 to 99, FirstRule and seconds rule from 100 to 133, LastRule) and apply these Rules in two steps as below
  3. first apply FirstRule on input DataFrame and get DataFrame.
  4. After it I pass resulted Dataframe with LastRule and get final DataFrame.
  5. It is working perfectly for my case.
Nikhil Suthar
  • 2,289
  • 1
  • 6
  • 24