0

I am new to Hadoop and Spark. I am trying to process almost 100GB data on my local system with 4-Core and 32GB-Ram. My code is taking time one and half hour to count only data. Am i doing something wrong? Please help.

My Code is below:

public class ReadCSV {
  public static void main(String[] args) {
    long startTime = System.currentTimeMillis();
    SparkSession sparkSession = SparkSession.builder().appName("CsvReader")
      .master("local[4]")
      .config("spark.sql.debug.maxToStringFields", 100)
      .getOrCreate();
    SparkContext sparkContext = sparkSession.sparkContext();
    sparkContext.setLogLevel("ERROR");

    try {
      String filePath = "/mnt/vol2/OpenLR/openlr/processedData/Monday/*/*/*.csv";
      Dataset<Row> dataset = sparkSession.read()
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(filePath);
      System.out.println("Total: " + dataset.count());
      System.out.println("Time taken to complete: " + (System.currentTimeMillis() - startTime));
    } catch (Exception e) {
      e.printStackTrace();
    }

    sparkContext.stop();
    sparkSession.close();
  }
}
Guoran Yun
  • 319
  • 2
  • 14

2 Answers2

0

The code looks straightforward, but since the data is on a mounted disk then I suspect that most of the time is wasted in the network to read 100 GB.

If the data is on the same processing machine then still the reading speed and how much contention the disk can handle (since spark will read multiple files in parallel) will impact the final throughput/time

According to Spark tuning guide

Data locality can have a major impact on the performance of Spark jobs. If data and the code that operates on it are together then computation tends to be fast..

Islam Elbanna
  • 1,438
  • 2
  • 9
  • 15
0

You can try to remove infer schema if you are sure with data quality of the file.

Check this: Performance Overhead..

ZMI
  • 1
  • 1
  • I remove infer schema but it reduce only 100 seconds – Ajit Sharma Feb 24 '23 at 04:53
  • Can you try local[*]? Spark will decide the number of threads. And you can check number of partitions of the dataset, and increase it maybe. https://stackoverflow.com/questions/61369523/reading-huge-csv-file-with-spark – ZMI Feb 25 '23 at 09:39