As per my experience,
It depends on spark.default.parallelism
Scenario 1 :
File Size : 75MB
defaultParallelism : 8
>>> sc.defaultParallelism
8
>>> booksDF = spark.read.option("inferSchema","true").option("header","true").csv("file:///C:\\Users\\Sandeep\\Desktop\\data\\Turkish_Book_Dataset_Kaggle_V2.csv")
>>> booksDF.rdd.getNumPartitions()
8
Scenario : 2
File Size : 75MB
defaultParallelism : 10
>>> sc.defaultParallelism
10
>>> booksDF = spark.read.option("inferSchema","true").option("header","true").csv("file:///C:\\Users\\Sandeep\\Desktop\\data\\Turkish_Book_Dataset_Kaggle_V2.csv")
>>> booksDF.rdd.getNumPartitions()
10
Scenario 3
File Size : 75MB
defaultParallelism : 4
>>> sc.defaultParallelism
4
>>> booksDF = spark.read.option("inferSchema","true").option("header","true").csv("file:///C:\\Users\\Sandeep\\Desktop\\data\\Turkish_Book_Dataset_Kaggle_V2.csv")
>>> booksDF.rdd.getNumPartitions()
4
Scenario 4 :
File Size : 75MB
defaultParallelism : 100
>>> sc.defaultParallelism
100
>>> booksDF = spark.read.option("inferSchema","true").option("header","true").csv("file:///C:\\Users\\Sandeep\\Desktop\\data\\Turkish_Book_Dataset_Kaggle_V2.csv")
>>> booksDF.rdd.getNumPartitions()
18
In scenario 4 , it divided data into possible number of partitions i.e 18
Based on it I am infering, initial number is dependent on value of spark.default.parallelism.
And if spark.default.parallelism is set to higher number the it just creates possible number of partitions based on hashing.