0

I'm not able to format those cells whose values are in html format. i need to apply clean function only on selected columns which starts with "Answer". The column(which Starts with "Answers") have values in html format. How can it be formatted ? I'm using regex but it doesnt help. I have shown the input format in the comment section. there are some columns which has html values.which I need to change.

import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.Column
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._


object AppRunner extends App with Serializable
{

  System.setProperty("hadoop.home.dir", "C:\\hadoop-common-2.2.0-bin-master")

  val input = "input//ResultsFile1.csv"
  val output = "file:///G:/csvfile/output.csv"


  val spark = org.apache.spark.sql.SparkSession.builder
    .master("local")
    .appName("Spark CSV Reader")
    .getOrCreate;

  var df = spark.read.format("csv").option("header", "true").load(input)
  //df.show()
  var count = 0

  def cleanHtml(df: DataFrame): DataFrame = {
//    val interestColumns = df.columns
//    for(targetCol <- interestColumns) {
//      if (targetCol.contains("Answer"))
//        df.withColumn(targetCol,clean(col(targetCol)))
//    }
   val dfi = df.columns.foldLeft(df){
      (memoDf, colName) =>
        if (colName.contains("Answer"))
        {
          count = count + 1
          memoDf.withColumn(colName,clean(col(colName)))
        }
        else
        {
          memoDf
        }
    }
    dfi
  }


  def clean(col:Column):Column = {
   // if(col.name contains "Answer")
    val reg = """<(?!\/?a(?=>|\s.*>))\/?.*?>"""
   // val reg ="""<(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>"""
    regexp_replace(col, reg,"")
  }

    //val test = df
    // println(cleanHtml(df))
  df = cleanHtml(df)
  println(count)
  df.show()
  df.write
    .option("header","true")
    .csv(output)
}

Expected output for every cells with html values like the input(in the comment section):

1) HuskySort -> need a really fast sort, and data tend to be sorted 

Average:

  • Husky n*ln(n) = 6908, but sorted n = 1000

  • Robin 1/2*nlogn = 1500

  • INFO 2n*ln(n) = 13816

  • Trump 2n*ln(n) = 13816 

     2) RobinSort -> several key, need a stable sort

3) INFO Sort -> no addition memory, need in place sort

4) HuskySort -> worst time complexity is good

Worst:

  • Husky n*log(n) = 3000

  • Robin n^1.5 = 31622

  • INFO n^2 = 1000000

  • Trump n^2 = 1000000

5) RobinSort 9 x Average + 1 x Worst:

  • Husky: nlogn + 9nln(n) = 65170

  • Robin: 4.5n*log n + n^1.5 = 45123

  • "I'm using regex but it doesnt help" - can you please clarify what "doesn't help" means? Would be helpful if you edit the post to show sample input and expected vs. actual output. – Tzach Zohar Apr 09 '18 at 18:30
  • input:
    1) HuskySort -> need a really fast sort, and data tend to be sorted 

    Average:

    - Husky n*ln(n) = 6908, but sorted n = 1000

    - Robin 1/2*nlogn = 1500

    - INFO 2n*ln(n) = 13816

    – balaji mudaliyar Apr 09 '18 at 19:51
  • @TzachZohar I have added the input and ouput. Can you please help me with the solution. – balaji mudaliyar Apr 09 '18 at 20:03
  • First - to make this a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) (with emphasis on "minimal") - all the Apache Spark stuff are irrelevant; If all you're looking for is a function that turns HTML into it's "text", next time include _only that_ in the question. And tag the post appropriately (`html`, `regex` and not `apache-spark`?). Second - here's a related post about the futility of using Regex to parse HTML: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. – Tzach Zohar Apr 09 '18 at 20:05
  • @TzachZohar Thanks for your reply. But I have a csv file where certains columns have values in the html format thats why I'm using dataframe. Is it possible to change those html values to normal text? thank you. – balaji mudaliyar Apr 09 '18 at 20:15
  • @LuigiPlinge Hi, can you please help me with this problem. thanks a lot in advance. I need to show the csv file using play framework but the some columns has values in html format which are basically code. how should i clean the csv(convert the html data to text for some columns) – balaji mudaliyar Apr 26 '18 at 00:27

0 Answers0