26

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused.

Simply put, I have a parquet file - say users.parquet. Now I am struck here on how to load/insert/import data from the users.parquet into hive (obviously into a table).

Please advise or point me in right direction if I am missing something obvious.

Creating hive table using parquet file metadata

https://phdata.io/examples-using-textfile-and-parquet-with-hive-and-impala/

Community
  • 1
  • 1
annunarcist
  • 1,637
  • 3
  • 20
  • 42
  • 1
    A trick I have done with ORC files (to clone a Prod table into a Test cluster, actually): create an non-partitioned table with the same exact structure; copy the data file(s) to the directory used by the new table; *voila*, the table is populated. Might work with Parquet too. – Samson Scharfrichter Dec 16 '15 at 10:58

4 Answers4

14

Get schema of the parquet file using parquet tools, for details check link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html

and build table using the schema on the top of the file, for details check Create Hive table to read parquet files from parquet/avro schema

Community
  • 1
  • 1
Ram Manohar
  • 1,004
  • 8
  • 18
11

Getting the schema is crucial, as you will have to create the table with the appropriate schema first in Hive and then point it to the parquet files.

I had a similar problem, where I had data in one VM and had to move it to another. Here is my walkthrough:

  1. Find out about original Parquet files are (location and schema): describe formatted users; and show create table users; The latter will get you the schema right away and also point you to the location of HDFS hdfs://hostname:port/apps/hive/warehouse/users

  2. Find out about the partitioning of your table show partitions users;

  3. Copy the table's Parquet files from HDFS to a local directory

    hdfs dfs -copyToLocal /apps/hive/warehouse/users
    
  4. Move them across to the other cluster/VM or where you want them to go

  5. Create the users table on your destination CREATE USERS ... by using the same schema

    CREATE TABLE users ( name string, ... )
    PARTITIONED BY (...)
    STORED AS PARQUET;
    
  6. Now, move the Parquet files in the respective folder (if needed find out about the location of the table you've just created)

    hdfs dfs -ls /apps/hive/warehouse/users/
    hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/
    
  7. For each partition, you'll have to point Hive to the respective subdirectory: alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231'; (you might want to do this with a bash script)

That worked for me, hope it helps.

Hendrik F
  • 3,690
  • 3
  • 21
  • 24
6

Don't know if it's a bit "hacky" but I use zeppelin (shipped with ambari). You can simply do the following in combination with spark2:

%spark2
import org.apache.spark.sql.SaveMode

var df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet");
df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table")

The advantage of this way is that you can also import many parquet files even if they have a different schema.

Fabian
  • 3,139
  • 2
  • 23
  • 49
  • i believe, above way is creating hive managed table instead of external table. – soMuchToLearnAndShare Aug 26 '19 at 09:07
  • could you show us the complete code? I have no idea where to set up hive connection in this code – yuzhen Jun 09 '20 at 02:13
  • My answer js from 2017 and meanwhile it could be that lot of stuff changed in spark, but If I remember correctly you don't need to set up your connection if you use zeppelin like I wrote (you can use the interpreter settings etc.) .. if you dont use zeppelin please search for setting up hive connection because I'm not really up to date with the lastest spark/hive version – Fabian Jun 09 '20 at 12:23
0

You can try this... The export/import works for all types of file format including parquet in Hive. This is general concept, you can tweak little bit based on your requirement like load from local (or) across cluster

Note: You can hard code instead of $ when you execute individual steps, also pass the "HDFS path", "Schema" and "tablename" as parameter when you run it from script. So you can export/import unlimited tables just by passing the parameter

  • Step1: hive -S -e "export table $schema_file1.$tbl_file1 to '$HDFS_DATA_PATH/$tbl_file1';" # -- Execute from HDFS.
  • Step2: # --It contains both data and metadata. zip it and scp to target cluster
  • Step3: hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';" # -- First import will through an error as table doesn't exist, but automatically create a table
  • Step4: hive -S -e "import table $schema_file1.$tbl_file1 from '$HDFS_DATA_PATH/$tbl_file1';" # -- Second import will import the data without any error as table available now

Thanks

Kumar

saranvisa
  • 135
  • 2
  • 7