1

I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.

Code read the text file as RDD with

sparkContext.textFile("abc.txt") 

then reads JSON schema and gets the column names and width of each column.

  • In the function I read the fixed length string and using the start and end position we use substring function to create the Array.

  • Map the function to RDD.

  • Convert the above RDD to DF and map colnames and write to Parquet.

The representative code

rdd1=spark.sparkContext.textfile("file1")

{ var now=0
 { val collector= new array[String] (ColLenghth.length) 
 val recordlength=line.length
for (k<- 0 to colLength.length -1)
 { collector(k) = line.substring(now,now+colLength(k))
 now =now+colLength(k)
 }
 collector.toSeq}


StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths



StringArray.toDF("StringCol")
  .select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
  .write.mode("overwrite").parquet("c"\home\")

This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns. As number of columns increases , it also increase the time.

If anyone has faced such issue with large number of columns. I need suggestions on performance tuning , how can I tune this Job or code

zero323
  • 322,348
  • 103
  • 959
  • 935
katty
  • 167
  • 1
  • 2
  • 11
  • It's difficult to say without your code examples. it could be related to this problem https://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns. Could you post you code and spark `.explain()` plan? – Avseiytsev Dmitriy Sep 12 '18 at 12:14
  • have a look at representative code mentioned in previous thread https://stackoverflow.com/questions/51118204/spark-java-heap-space-issue-executorlostfailure-container-exited-with-stat – katty Sep 12 '18 at 12:54
  • Could you provide syntactically correct and well-formatted piece of code and Spark explain plan? – Avseiytsev Dmitriy Sep 12 '18 at 13:28
  • Hi Dmitry could nit provide the code and plan as it's organisation confidential . – katty Sep 12 '18 at 16:03
  • 1
    Looks like you load the data _twice_ in memory. Did you consider using an efficient parser (like univocity) to read the raw file and build the RDD one record at a time? _(note that Spark has the option to use univocity to load CSV into DataFrames so you have something to start with in the Spark repo...)_ – Samson Scharfrichter Sep 12 '18 at 19:29
  • Thanks Samson . I am new to spark . Reading RDD twice , does it mean while I first load into RDD and again apply map function second time. Can I use cache here to avoid reading two times. Can you please guide me in detail – katty Sep 13 '18 at 06:56

0 Answers0