Having a following python script:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Test') \
.config("spark.driver.extraJavaOptions", "-Xss1G") \
.master('local[*]') \
.getOrCreate()
dataframe = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database",
"finance")\
.option("collection", "finished13").option(
"uri", "mongodb://localhost:27017/test") \
.option("pipeline",
"[ { '$sort': {'prop1':-1} }, {'$limit': 700}]").load()
dataframe = dataframe.limit(2)
dataframe.show()
Spark initially starts-off by sampling over the Mongodb structure therefore the load() call takes a while. Dataframe.limit(2) call is immediate (as expected) as spark is a lazy executor. At the other hand I write the very same code in Java, as seen below:
SparkSession session = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.driver.extraJavaOptions", "-Xss1G")
.getOrCreate();
Dataset<Row> dataset = session.read().format("com.mongodb.spark.sql.DefaultSource").option("collection", "finished13")
.option("uri", "mongodb://localhost:27017/test").option("pipeline",
"[ { '$sort': {'prop1':-1} }, {'$limit': 700}]").load();
dataset = dataset.limit(2);
Object rows = dataset.collect();
In the Java version, dataset.limit(2) takes approximately 9 minutes to complete. It is sampled over the same collection. Collection consists of roughly 30k documents. Each document is of the same structure and contains around 27k properties (first level, no nesting). Average document size is 1.5MB. Any idea why does Java version takes eternity to complete?