I have a Hive table X which has multiple files on HDFS. Table X location on HDFS is /data/hive/X. Files:
/data/hive/X/f1
/data/hive/X/f2
/data/hive/X/f3 ...
Now, I run the below commands:
df=hiveContext.sql("SELECT count(*) from X")
df.show()
What happens internally? Does each file be considered as a separate partition and is processed by a separate node and then results are collated?
If yes, is there a way to instruct Spark to load all the files into 1 partition and then process the data?
Thanks in advance.