0

I am importing fixed width file from source which very wide dataset having fixed width string length close to 120000 and i need to extract 20K columns from that and write to parquet.

Can anyone suggest me some performance optimization techniques which can reduce the file reading time. I am reading the source file as RDD. but it takes lot of time.

Can anyone suggest different ways which will reduce the time like JAVA IO stream.

katty
  • 167
  • 1
  • 2
  • 11
  • Why do you use RDD instead of Dataframe? – Emiliano Martinez Sep 18 '18 at 09:05
  • Source file is Fixed Width file and I need to map function to RDD which splits the columns based on width in schema. Currently there is support for fixed width file from spark – katty Sep 18 '18 at 09:26
  • Put your code to see if it can be optimized – Emiliano Martinez Sep 18 '18 at 09:31
  • link to code : https://stackoverflow.com/questions/51118204/spark-java-heap-space-issue-executorlostfailure-container-exited-with-stat – katty Sep 18 '18 at 09:37
  • How often do you open that file? Would converting it to Parquet or ORC, which are optimised for the projected scans, and than using those for repetitive reads be acceptable? – alexeipab Sep 19 '18 at 11:04
  • Haven't tried this option to convert .txt to parquet and then reading. But I have tried other options like repartitioning , caching but no significant improvement – katty Sep 19 '18 at 20:16

0 Answers0