reading very wide data set performance optimization

Asked Sep 18 '18 at 08:57

Active Sep 20 '18 at 10:42

Viewed 61 times

I am importing fixed width file from source which very wide dataset having fixed width string length close to 120000 and i need to extract 20K columns from that and write to parquet.

Can anyone suggest me some performance optimization techniques which can reduce the file reading time. I am reading the source file as RDD. but it takes lot of time.

Can anyone suggest different ways which will reduce the time like JAVA IO stream.

asked Sep 18 '18 at 08:57

katty

Why do you use RDD instead of Dataframe? – Emiliano Martinez Sep 18 '18 at 09:05
Source file is Fixed Width file and I need to map function to RDD which splits the columns based on width in schema. Currently there is support for fixed width file from spark – katty Sep 18 '18 at 09:26
Put your code to see if it can be optimized – Emiliano Martinez Sep 18 '18 at 09:31
link to code : https://stackoverflow.com/questions/51118204/spark-java-heap-space-issue-executorlostfailure-container-exited-with-stat – katty Sep 18 '18 at 09:37
How often do you open that file? Would converting it to Parquet or ORC, which are optimised for the projected scans, and than using those for repetitive reads be acceptable? – alexeipab Sep 19 '18 at 11:04
Haven't tried this option to convert .txt to parquet and then reading. But I have tried other options like repartitioning , caching but no significant improvement – katty Sep 19 '18 at 20:16

reading very wide data set performance optimization

0 Answers0