Spark Job Processing

Question

I am new to Hadoop and Spark. I am trying to process almost 100GB data on my local system with 4-Core and 32GB-Ram. My code is taking time one and half hour to count only data. Am i doing something wrong? Please help.

My Code is below:

public class ReadCSV {
  public static void main(String[] args) {
    long startTime = System.currentTimeMillis();
    SparkSession sparkSession = SparkSession.builder().appName("CsvReader")
      .master("local[4]")
      .config("spark.sql.debug.maxToStringFields", 100)
      .getOrCreate();
    SparkContext sparkContext = sparkSession.sparkContext();
    sparkContext.setLogLevel("ERROR");

    try {
      String filePath = "/mnt/vol2/OpenLR/openlr/processedData/Monday/*/*/*.csv";
      Dataset<Row> dataset = sparkSession.read()
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(filePath);
      System.out.println("Total: " + dataset.count());
      System.out.println("Time taken to complete: " + (System.currentTimeMillis() - startTime));
    } catch (Exception e) {
      e.printStackTrace();
    }

    sparkContext.stop();
    sparkSession.close();
  }
}

Then I guess this is the bottleneck, since the code is simple then I suspect most of the time is wasted in the network cost — Islam Elbanna, Feb 23 '23 at 11:54
Data and code are on same machine(Ubuntu) Data Path: /mnt/vol2/OpenLR/openlr/processedData/Monday/*/*/*.csv Code Path: mnt/vol2/BigDataWorkSpace — Ajit Sharma, Feb 23 '23 at 12:14

Islam Elbanna · Answer 1 · 2023-02-23T12:37:02.647

The code looks straightforward, but since the data is on a mounted disk then I suspect that most of the time is wasted in the network to read 100 GB.

If the data is on the same processing machine then still the reading speed and how much contention the disk can handle (since spark will read multiple files in parallel) will impact the final throughput/time

According to Spark tuning guide

Data locality can have a major impact on the performance of Spark jobs. If data and the code that operates on it are together then computation tends to be fast..

ZMI · Answer 2 · 2023-02-23T12:05:07.200

0

You can try to remove infer schema if you are sure with data quality of the file.

Check this: Performance Overhead..

edited Feb 23 '23 at 12:05

answered Feb 23 '23 at 12:04

ZMI

1
1

I remove infer schema but it reduce only 100 seconds – Ajit Sharma Feb 24 '23 at 04:53
Can you try local[*]? Spark will decide the number of threads. And you can check number of partitions of the dataset, and increase it maybe. https://stackoverflow.com/questions/61369523/reading-huge-csv-file-with-spark – ZMI Feb 25 '23 at 09:39

Spark Job Processing

2 Answers2