0

Here is how I use Spark-SQL in a little application I am working with. I have two Hbase tables say t1,t2.

  • My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.

  • Now I parse the second file and query the second table and I apply certain functions over the result and I output the data. the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.

Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.

CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.

For a input of 1100 records, It takes two hours to complete the whole process!

I was wondering if I could speed up the process but I am struck. Any help?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Do you really need the intermediate file? You definitely do not need to parse the file if you have the RDD that generated it – OneCricketeer Oct 30 '16 at 06:15
  • yes I agree with @cricket_007 that file can be another dataframe. Isnt it ? – Ram Ghadiyaram Oct 30 '16 at 06:16
  • I don't think that what's taking time is the parse but launching 1100 queries over the tables. This question is quite broad to answer. Would you care elaborating on what kind of queries do you have in your files so we can actually help ! – eliasah Oct 30 '16 at 07:01
  • @cricket_007 it feels like an iteration over filters kind of queries :-/ – eliasah Oct 30 '16 at 07:02
  • @eliasah select col1,col2,col3 from purchase where CstmrID ='' and select col1,col2,col3,col4,date from cartdetails where ='' and date<'' – theScalaGuy Oct 30 '16 at 09:16
  • So you are actually iterating. I totally understand the obfuscation but if you can't be more specific or at least data samples, I'm afraid we can't help. – eliasah Oct 30 '16 at 10:43

1 Answers1

1

How about this DataFrame approach...

1) Create a dataframe from CSV.

how-to-read-csv-file-as-dataframe or something like this in example.

val csv = sqlContext.sparkContext.textFile(csvPath).map {
  case(txt) =>
    try {
      val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
      val parsedRow = reader.readNext()
      Row(mapSchema(parsedRow, schema) : _*)
    } catch {
     case e:  IllegalArgumentException =>  throw new UnsupportedOperationException("converted from Arg to Op except")
 }
    }

2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.

3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again

4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...

It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file. you can this kind in spark shell as well.

val df = sqlContext.read.format("com.databricks.spark.csv").
                             option("header", "true").
                             option("inferSchema","true").
                             load("cars93.csv")
    val df2=df.filter("quantity <= 4.0")
    val col=df2.col("cost")*0.453592
    val df3=df2.withColumn("finalcost",col)
    df3.write.format("com.databricks.spark.csv").
                             option("header","true").
                             save("output-csv")

Hope this helps.. Good luck.

Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • The Hbase connector probably isn't necessary. Phoenix is allowing SQL against Hbase from SparkSQL – OneCricketeer Oct 30 '16 at 06:26
  • @RamPrasadG, I thought of using the data frame but i don't think in my case it would come handy, in my case, I want the output file(CSV) in a particular format and i have mentioned in my above edited question. – theScalaGuy Oct 30 '16 at 09:11
  • " I want the output file(CSV) in a particular format " That doesn't matter for that particular format prepare schema.. please see my updated answer. – Ram Ghadiyaram Nov 02 '16 at 11:06