I'm not able to format those cells whose values are in html format. i need to apply clean function only on selected columns which starts with "Answer". The column(which Starts with "Answers") have values in html format. How can it be formatted ? I'm using regex but it doesnt help. I have shown the input format in the comment section. there are some columns which has html values.which I need to change.
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object AppRunner extends App with Serializable
{
System.setProperty("hadoop.home.dir", "C:\\hadoop-common-2.2.0-bin-master")
val input = "input//ResultsFile1.csv"
val output = "file:///G:/csvfile/output.csv"
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Spark CSV Reader")
.getOrCreate;
var df = spark.read.format("csv").option("header", "true").load(input)
//df.show()
var count = 0
def cleanHtml(df: DataFrame): DataFrame = {
// val interestColumns = df.columns
// for(targetCol <- interestColumns) {
// if (targetCol.contains("Answer"))
// df.withColumn(targetCol,clean(col(targetCol)))
// }
val dfi = df.columns.foldLeft(df){
(memoDf, colName) =>
if (colName.contains("Answer"))
{
count = count + 1
memoDf.withColumn(colName,clean(col(colName)))
}
else
{
memoDf
}
}
dfi
}
def clean(col:Column):Column = {
// if(col.name contains "Answer")
val reg = """<(?!\/?a(?=>|\s.*>))\/?.*?>"""
// val reg ="""<(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>"""
regexp_replace(col, reg,"")
}
//val test = df
// println(cleanHtml(df))
df = cleanHtml(df)
println(count)
df.show()
df.write
.option("header","true")
.csv(output)
}
Expected output for every cells with html values like the input(in the comment section):
1) HuskySort -> need a really fast sort, and data tend to be sorted
Average:
Husky n*ln(n) = 6908, but sorted n = 1000
Robin 1/2*nlogn = 1500
INFO 2n*ln(n) = 13816
Trump 2n*ln(n) = 13816
2) RobinSort -> several key, need a stable sort
3) INFO Sort -> no addition memory, need in place sort
4) HuskySort -> worst time complexity is good
Worst:
Husky n*log(n) = 3000
Robin n^1.5 = 31622
INFO n^2 = 1000000
Trump n^2 = 1000000
5) RobinSort 9 x Average + 1 x Worst:
Husky: nlogn + 9nln(n) = 65170
Robin: 4.5n*log n + n^1.5 = 45123
1) HuskySort -> need a really fast sort, and data tend to be sorted
Average:
- Husky n*ln(n) = 6908, but sorted n = 1000
- Robin 1/2*nlogn = 1500
- INFO 2n*ln(n) = 13816
– balaji mudaliyar Apr 09 '18 at 19:51