Current Setup: 1. HDInsight for Spark 2. Azure Datalake Store for HDFS storage
I am trying to load data into a hive table from the partition located on Azure DLS for writing to the partiton location I am using this code -
from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import HiveContext
sc = SparkContext(appName="Transformation").getOrCreate()
sql_context = SQLContext(sc)
sql_context.setConf("hive.exec.dynamic.partition", "true")
sql_context.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
sql_context.setConf("hive.exec.max.dynamic.partitions.pernode", "400")
sql_context.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df = sql_context.read.orc("adl://LOCATION_NAME") # current data in ORC format stored on ADLS
df.write.format("parquet").partitionBy('ReportId', 'ReportDeliveryDate').save('adl://LOCATION_NAME') # writing the data
I can see the partition data at the location :
Now I am creating the table with the following command in Hive:
DROP TABLE IF EXISTS default.test_tb;
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_tb
(
A string,
B string,
)
PARTITIONED BY (ReportId STRING, ReportDeliveryDate STRING)
STORED AS PARQUET
LOCATION 'adl://LOCATION_NAME';
and when I am making this query - select * from test_tb
there is no output.
I have tried many methods like one provided here. Not sure what I am doing wrong here.
----Update---
The problem here is different, I had a long discussion with Azure Team and the problem was HIVE
doesn't support UpperCase
partitions so if you create them in lowercase
from spark
and then run repair table command
it worked