I would like to transfer one big (over 150 mln records and 700 columns) table from one Hive database to another, that includes a few transformations like using one cast on a date column, substr on a string column and one simple case statement.
So, something like this:
-- initial settings
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.intermediate=true;
SET hive.exec.parallel=true;
SET parquet.compression=SNAPPY;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.merge.size.per.task=1000000000;
SET hive.merge.smallfiles.avgsize=1000000000;
INSERT INTO databaseA.tableName PARTITION(parition_col)
CASE WHEN a='Something' THEN 'SOMETHING'
WHEN a is null THEN 'Missing'
ELSE a END AS a,
column1,
column2,
...
cast(to_date(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) AS string) AS
run_date,
substr(some_string, 1, 3)
FROM databaseB.tableName;
The problem is that this query is going to take a lot of time (1 mln rows per hour). Maybe anybody knows how to speed it up?
I'm using map reduce engine for this task.
Thanks!