Efficient way to read very wide dataset in scala or java

Question

I need to ingest very wide fixed width dataset in parquet columns Cureentky i am reading wide dataset as RDD in scala and then splitting the columns with substring function and then write to parquet

Currently fixed width records are close 10 millions and it takes ti 2 days to load the data .

Can anyone please tell me which is the most efficient way of reading the wide dataset in scala or java

Hi @Katty. Please don't ask the same question over and over again. Instead try to edit the original one, and provide additional details, including [mcve] / [reproducible example](https://stackoverflow.com/q/48427185) (this will make people more likely to help you). In general - very wide data is a poor fit for Spark but within some limits you can make it work depending on the operations you want to implement. — zero323, Sep 20 '18 at 10:46
I will delete the duplicate questions , but can you tell me how to deal with very wide dataset , I am using RDD to read but takes lot of time — katty, Sep 20 '18 at 10:49
At first glance the code you provided in other question would compile, but considering overall shape, I doubt you can do much without rethinking overall choice of tools and / or output shape. Creating Dataset[Row] directly could give you some small performance boost though. — zero323, Sep 20 '18 at 11:57

Efficient way to read very wide dataset in scala or java

0 Answers0