0

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark

Ex:

warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()

DF = spark.sql("select * from hive_table")

In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.

I am just wondering how the SQL is being processed. Whether in Hive or in Spark?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Harish
  • 285
  • 7
  • 20

2 Answers2

4

enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.

In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.

Hive support does not imply:

  • Full Hive Query Language compatibility.
  • Any form of computation on Hive.
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
1

SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.

The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Jagrut Sharma
  • 4,574
  • 3
  • 14
  • 19
  • Thanks. Are u saying that though it uses the Hive table, the query execution(select * from ) happens at Spark. Meaning spark will directly read the underlying table files from file system? – Harish May 07 '18 at 12:44
  • Yes, correct. `SparkSQL` will leverage the `Hive` metastore to access metadata for the `Hive` tables. Then, the work of reading the table files from disk, and processing them and running the query is all done via the `Spark` engine. – Jagrut Sharma May 07 '18 at 14:46
  • Thank you for explaining it. In general does SparkSQL mean executing SQL queries like in the above example in Spark? – Harish May 08 '18 at 06:35
  • 1
    `SparkSQL` allows executing `SQL` queries on existing `Hive` tables via `spark.sql("query")`. This is great since you can improve performance, while using the `Hive` setup in existing `Hadoop` cluster. `SparkSQL` can also be used via the `DataSet API`. Here, you construct a `Dataset/DataFrame` from existing `RDD`/data file (e.g. `JSON`, `Parquet`), and then use transformations like `filter()`, `groupBy()`, `map()` on it. A `DataFrame` provides tabular view of data, and since it has a schema associated with it, `SparkSQL` can process it more efficiently than a plain `RDD`. – Jagrut Sharma May 08 '18 at 07:56