3

I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.

So, something like this:

-- initial settings 
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true; 
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;

INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
     WHEN a is null THEN 'Missing'
     ELSE a END AS a,
column1,
column2, 
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS 
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;

The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?

I'm using map reduce engine for this task.

Thanks!

leftjoin
  • 36,950
  • 8
  • 57
  • 116
max04
  • 5,315
  • 3
  • 13
  • 21
  • try inserting the result to a temp table and then insert it into the main table. – Vamsi Prabhala Dec 18 '17 at 21:50
  • Thank you @Vamsi. Now is better. Any other ideas for improvement? – max04 Dec 19 '17 at 10:50
  • In hive there is no performance difference between loading data into a different database or the same database from which you are reading the data. A hive database usually refers to a hdfs directory. In terms of SQL I can't see how you can tune this. Performance will be decided by the size of your HDFS cluster. Have you tried running this? and what are the metrics? – Ramesh Dec 19 '17 at 13:58
  • The most important thing here is that this process ending up one big file (20 GB) instead of a few smaller. Also, reduce process is taking most of the time so maybe there is something with above settings? Why such a big file ended up as one big file? There is a lot of mappers and reducers during this process and reduce = 99% is taking a few hours. – max04 Dec 19 '17 at 16:43

2 Answers2

0

As all the data in the Hive tables are files on HDFS why don't you move/copy the files directly into the new table's HDFS location.

Example:

Assuming the table you want to move is already present in db1 as table_to_cpy;

create database db2;
create table db2.table_to_cpy like db1.table_to_cpy;
desc formatted db1.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db1.db/table_to_cpy

desc formatted db2.table_to_cpy;
--copy the hdfs table path ---> /user/hive/warehouse/db2.db/table_to_cpy

hadoop fs -cp /user/hive/warehouse/db1.db/table_to_cpy/* /user/hive/warehouse/db2.db/table_to_cpy/.
sk7979
  • 140
  • 2
  • 18
0

Few suggestions on how to speed-up your query:

  1. Avoid unix_timestamp() if possible. This function is non-deterministic and prevents proper optimization of queries, it will be executed in each mapper or reducer and may return different values. Use instead

    current_date() AS run_date

    See also this answer for more details: https://stackoverflow.com/a/41140298/2700344

  2. Tune mappers and reducers parallelism. If your process ending up with one big file (20 GB) instead of a few smaller then obviously there is not enough parallelism.

For mappers, play with these settings:

set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.minsize=1073741824; -- 1 GB

Decrease hive.exec.reducers.bytes.per.reducer to increase the number of reducers.

Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.

See this answer for details: https://stackoverflow.com/a/42842117/2700344

  1. Try to add distribute by parition_col It will distribute data between reducers according to partition keys and as a result each reducer will create less partitions and consume less memory. Also it helps to avoid too many small output files. This setting should be used with hive.exec.reducers.bytes.per.reducer to avoid problem with uneven distribution between reducers and to avoid too big output files.
leftjoin
  • 36,950
  • 8
  • 57
  • 116