I am writing data to a parquet
file format using peopleDF.write.parquet("people.parquet")
in PySpark
code. Now what I am trying to do is that from the same code I want to create a table
on top of this parquet
file which then I can later query from. How can I do that?
Asked
Active
Viewed 3,518 times
0

user2966197
- 2,793
- 10
- 45
- 77
-
What table? You can just load that parquet to DataFrame. Register it as a temporary table and run your query using sparkSQL. Or tell us how you what to run query? – iurii_n Apr 11 '17 at 15:22
-
@YuriyNedostup What I want is to create a hive table based on the parquet file that I wrote. I don't want a temporary table – user2966197 Apr 11 '17 at 15:24
-
Are your parquet files stored in HDFS? – iurii_n Apr 11 '17 at 15:26
-
@lurriNedostup Yes the parquet files are in hdfs – user2966197 Apr 11 '17 at 15:29
2 Answers
1
You can use the saveAsTable
method :
peopleDF.write.saveAsTable('people_table')

Spandan Brahmbhatt
- 3,774
- 6
- 24
- 36
-
But thats saving the dataframe as a table and not creating the table on top of the parquet file – user2966197 Apr 11 '17 at 15:16
0
you have to create external table in hive
like this:
CREATE EXTERNAL TABLE my_table (
col1 INT,
col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';
Where /path/to/
is absolute path to files in HDFS.
If you want to use partitioning you can add PARTITION BY (col3 INT)
. In that case to see the data you have to execute repair
.

iurii_n
- 1,330
- 10
- 17
-
-
you don't have to. Just make sure that files are in directory. Each time when new data added you have to execute repair and invalidate metadata to see the changes. You can query your table through hive in command line or by using some tools like sqlWorkbench – iurii_n Apr 11 '17 at 15:38
-
ah, you can try this: http://stackoverflow.com/questions/36051091/query-hive-table-in-pyspark. Once table is created you can load it by HiveContext – iurii_n Apr 11 '17 at 15:41
-
The reason I want to do it through pyspark code is because I want to automate the create table step so that I don't have to run the create table command manually. As soon as the parquet file is created the code will then create the table on top of that parquet file – user2966197 Apr 11 '17 at 15:41
-
Well, you might have your own reasons for that :) I wouldn't do that. Probably pyspark has some functionality to do so but I don't know that. The one solution I can see is to extract schema from DataFrame, convert it to SQL statement and run in using HiveContext's sql function. – iurii_n Apr 11 '17 at 15:48
-
My parquet files location has gz.parquet file format along with _metadata file. When I execute the above command it does create the table but no records are getting entered in the table – user2966197 Apr 11 '17 at 16:25