0

I intend to apply linear regression on a dataset. it works fine when I apply a subset of the data in *.txt format as below:

// how could I read 26 *.tar.gz compressed files into a DataFrame?
val inputpath = "/Users/jasonzhu/Downloads/a.txt"

val rawDF = sc.textFile(inputpath).toDF()
val df = se.kth.spark.lab1.task2.Main.body(sqlContext, rawDF)
val splitDf = df.randomSplit(Array(0.95, 0.05), seed = 42L)
val (obsDF, testDF) =(splitDf(0).cache(), splitDf(1))

val maxIter = 6
val regParam = 0.07
val elasticNetParam = 0.1
println(s"maxIter=${maxIter}, regParam=${regParam}, elasticNetParam=${elasticNetParam}")

val myLR = new LinearRegression()
                .setMaxIter(maxIter)
                .setRegParam(regParam)
                .setElasticNetParam(elasticNetParam)
val lrStage = 0
val pipeline = new Pipeline().setStages(Array(myLR))
val pipelineModel: PipelineModel = pipeline.fit(obsDF)
val lrModel = pipelineModel.stages(lrStage).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary
//print rmse of our model
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
//do prediction - print first k
val predictedDF = pipelineModel.transform(testDF)
predictedDF.show(5, false)

After spiking, I intend to apply the whole dataset, which resides in 26 *.tar.gz files, to the linear regression model. I'd like to know how I should read these compressed files into a DataFrame of Spark and consume it efficiently by taking the advantage of parallelism in Spark. Thanks!

KAs
  • 1,818
  • 4
  • 19
  • 37

2 Answers2

1

textFile() method can take wildcards as well. From documentation:

All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

ShirishT
  • 232
  • 1
  • 4
  • but `textFile()` compains error when trying to read `*.tar.gz` file – KAs Nov 15 '16 at 23:45
  • For tarballs.. a relevant story is [here](http://stackoverflow.com/questions/38635905/reading-in-multiple-files-compressed-in-tar-gz-archive-into-spark) – ShirishT Nov 15 '16 at 23:56
0

Start with an empty RDD and run a loop to read each of the files as a RDD and keep adding the RDD to the initial RDD by a union operation in each iteration.

Lokesh Yadav
  • 958
  • 2
  • 9
  • 20