I am new to Hadoop and Spark. I am trying to process almost 100GB data on my local system with 4-Core and 32GB-Ram. My code is taking time one and half hour to count only data. Am i doing something wrong? Please help.
My Code is below:
public class ReadCSV {
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
SparkSession sparkSession = SparkSession.builder().appName("CsvReader")
.master("local[4]")
.config("spark.sql.debug.maxToStringFields", 100)
.getOrCreate();
SparkContext sparkContext = sparkSession.sparkContext();
sparkContext.setLogLevel("ERROR");
try {
String filePath = "/mnt/vol2/OpenLR/openlr/processedData/Monday/*/*/*.csv";
Dataset<Row> dataset = sparkSession.read()
.option("header", "true")
.option("inferSchema", "true")
.csv(filePath);
System.out.println("Total: " + dataset.count());
System.out.println("Time taken to complete: " + (System.currentTimeMillis() - startTime));
} catch (Exception e) {
e.printStackTrace();
}
sparkContext.stop();
sparkSession.close();
}
}