4

I am joining 3 huge tables (billion row tables) in HIVE. All the statistics are collected, but still the performance is very bad (query taking 40 minutes odd).

Is there any parameter which I can set in the HIVE prompt to get better performance?

When I am trying execution I am seeing info like

Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
Sep 4, 2015 7:40:23 AM INFO: parquet.hadoop.ParquetFileReader: reading another 1 footers

All the tables are created in BigSql with storage parameter as "STORED AS PARQUETFILE"

How can I suppress the job progress details when a HIVE query is running?

Regarding HIVE version

hive> set system:sun.java.command;
system:sun.java.command=org.apache.hadoop.util.RunJar /opt/ibm/biginsights/hive/lib/hive-cli-0.12.0.jar org.apache.hadoop.hive.cli.CliDriver -hiveconf hive.aux.jars.path=file:///opt/ibm/biginsights/hive/lib/hive-hbase-handler-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hive-contrib-0.12.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-client-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-common-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-hadoop2-compat-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-prefix-tree-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-protocol-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/hbase-server-0.96.0.jar,file:///opt/ibm/biginsights/hive/lib/htrace-core-2.01.jar,file:///opt/ibm/biginsights/hive/lib/zookeeper-3.4.5.jar,file:///opt/ibm/biginsights/sheets/libext/piggybank.jar,file:///opt/ibm/biginsights/sheets/libext/pig-0.11.1.jar,file:///opt/ibm/biginsights/sheets/libext/avro-1.7.4.jar,file:///opt/ibm/biginsights/sheets/libext/opencsv-1.8.jar,file:///opt/ibm/biginsights/sheets/libext/json-simple-1.1.jar,file:///opt/ibm/biginsights/sheets/libext/joda-time-1.6.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets.jar,file:///opt/ibm/biginsights/sheets/libext/bigsheets-serdes-1.0.0.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-column-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-common-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-encoding-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-generator-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hadoop-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-hive-bundle-1.3.2.jar,file:///opt/ibm/biginsights/lib/parquet/parquet-mr/parquet-thrift-1.3.2.jar,file:///opt/ibm/biginsights/hive/lib/guava-11.0.2.jar
freedomn-m
  • 27,664
  • 8
  • 35
  • 57
Koushik Chandra
  • 1,565
  • 12
  • 37
  • 73
  • If you join 100% of records in all 3 tables then statistics will not do much difference. And 40 min for a massive Hive query is not that surprising. Some points of interest may be: MapReduce or Tez? what kind(s) of join *(cf. EXPLAIN)*? how many mappers for each join step *(cf. MR or Tez logs)*? lots of data skew *(cf. some mappers chewing a lot more data than others and slowing down the step)*? compression on intermediate results? etc. etc. etc. etc. etc. – Samson Scharfrichter Sep 04 '15 at 17:45
  • When I am using STORED PARAMETER as ORC instead of PARQUET, then I am getting very good performance improvement. Could you please let me why it is so? – Koushik Chandra Sep 06 '15 at 21:28
  • Why did you choose Hive+Parquet? Because of Cloudera marketing hype? See my comment on that other post (and the post BTW) http://stackoverflow.com/questions/32350664/why-is-parquet-slower-for-me-against-text-file-format-in-hive#comment52629679_32350664 – Samson Scharfrichter Sep 07 '15 at 09:27

1 Answers1

0

Koushik - This question I asked a month back will give you a good insight to performance of ORC vs Parquet.

Let me ask this question! What is the structure of your data? Is this nested or flatter? If this is a flatter data, example can be data ingested from an RDBMS, ORC is better since it has light indexes stored alongside the data and makes data retrieval faster.

Hope this helps

Community
  • 1
  • 1
Rahul
  • 2,354
  • 3
  • 21
  • 30
  • The data is flatten structure data. The original tables are actually created from bigsql and those are in parquet format. There are total 18 such tables are being joined (5 of them are huge tables). If I use those parquet structure and break the 18 tables' join into multi-step process then it is taking around total 6 hours to populate a de-normalized flatten table. – Koushik Chandra Sep 18 '15 at 12:39
  • If I have to use ORC format, then first I have to create ORC table for all 18 tables and populate those and then only I will be able to use those ORC tables in my multi-step join process. So the addition task is to populate first ORC tables from parquet. I don't think the time in that process will come down much compare to what it is taking now (considering the additional step to populate ORC tables). – Koushik Chandra Sep 18 '15 at 12:41