0

I have a Hive table X which has multiple files on HDFS. Table X location on HDFS is /data/hive/X. Files:

/data/hive/X/f1
/data/hive/X/f2
/data/hive/X/f3 ...

Now, I run the below commands:

df=hiveContext.sql("SELECT count(*) from X")
df.show()

What happens internally? Does each file be considered as a separate partition and is processed by a separate node and then results are collated?

If yes, is there a way to instruct Spark to load all the files into 1 partition and then process the data?

Thanks in advance.

user3031097
  • 177
  • 3
  • 15

1 Answers1

1

Spark will contact Hive metastore to find out (a) Location of data (b) How to read the data. At low level, Spark will get Input Splits based on Input Formats used in hive to store the data. Once Splits are decided, Spark will read data 1 split/partition. In Spark, one physical node can run one or more executors. Each executor will have one or more partitions. Once data is read into memory, spark will run a count, which will be (a) local counts on map (b) global count after a shuffle. then it is returned to driver as a result.

Ayan Guha
  • 750
  • 3
  • 10
  • This is correct. But if it is on HDFS, it could be that a single file becomes multiple partitions. This happens only when Hadoop is able to split the file. This is simple for line based formats such as CSV/TSV, but becomes more complex when also compression is used (http://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2). These splits are done at the HDFS block size level, so if you have a file of 300mb, and the HDFS block size is set at 128, then you get 3 blocks of 128mb, 128mb and 44mb respectively. – Fokko Driesprong Sep 14 '16 at 06:20
  • Thanks Ayan and Fokko. The files i have are small files, so it will surely be 1 file/partition. Is there anyway we can tell Spark to repartition all files into 1 partition? – user3031097 Sep 14 '16 at 17:09
  • 1
    rdd.repartition(1) – Ayan Guha Sep 15 '16 at 00:21
  • To Fokko's point, HDFS understand Files and Blocks only, no partitions. Let us be consistent with terminologies to keep life simpler :) Secondly, anything (even compressed files) you put onto HDFS is chunked up based on Block size (as Fokko mentioned). – Ayan Guha Sep 15 '16 at 00:24
  • Thanks Ayan. However, I am using Dataframes and generating the dataframe by running a query. So, how can i use rdd.repartition(1) option in this case? – user3031097 Sep 15 '16 at 22:10
  • Try to set `SET spark.sql.shuffle.partitions = 1` before executing the query. – Fokko Driesprong Sep 17 '16 at 21:16