Create a table from pyspark code on top of parquet file

Question

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. Now what I am trying to do is that from the same code I want to create a table on top of this parquet file which then I can later query from. How can I do that?

What table? You can just load that parquet to DataFrame. Register it as a temporary table and run your query using sparkSQL. Or tell us how you what to run query? — iurii_n, Apr 11 '17 at 15:22
@YuriyNedostup What I want is to create a hive table based on the parquet file that I wrote. I don't want a temporary table — user2966197, Apr 11 '17 at 15:24

score 1 · Answer 1 · answered Apr 11 '17 at 15:09

1

You can use the saveAsTable method :

peopleDF.write.saveAsTable('people_table')

answered Apr 11 '17 at 15:09

Spandan Brahmbhatt

3,774
6
24
36

But thats saving the dataframe as a table and not creating the table on top of the parquet file – user2966197 Apr 11 '17 at 15:16

score 0 · Answer 2 · answered Apr 11 '17 at 15:34

0

you have to create external table in hive like this:

CREATE EXTERNAL TABLE my_table (
    col1 INT,
    col2 INT
) STORED AS PARQUET
LOCATION '/path/to/';

Where /path/to/ is absolute path to files in HDFS.

If you want to use partitioning you can add PARTITION BY (col3 INT). In that case to see the data you have to execute repair.

answered Apr 11 '17 at 15:34

iurii_n

1,330
10
17

Thanks!That I know but how to execute this from the pyspark code? – user2966197 Apr 11 '17 at 15:36
you don't have to. Just make sure that files are in directory. Each time when new data added you have to execute repair and invalidate metadata to see the changes. You can query your table through hive in command line or by using some tools like sqlWorkbench – iurii_n Apr 11 '17 at 15:38
ah, you can try this: http://stackoverflow.com/questions/36051091/query-hive-table-in-pyspark. Once table is created you can load it by HiveContext – iurii_n Apr 11 '17 at 15:41
The reason I want to do it through pyspark code is because I want to automate the create table step so that I don't have to run the create table command manually. As soon as the parquet file is created the code will then create the table on top of that parquet file – user2966197 Apr 11 '17 at 15:41
Well, you might have your own reasons for that :) I wouldn't do that. Probably pyspark has some functionality to do so but I don't know that. The one solution I can see is to extract schema from DataFrame, convert it to SQL statement and run in using HiveContext's sql function. – iurii_n Apr 11 '17 at 15:48
My parquet files location has gz.parquet file format along with _metadata file. When I execute the above command it does create the table but no records are getting entered in the table – user2966197 Apr 11 '17 at 16:25

Create a table from pyspark code on top of parquet file

2 Answers2