I have many small tables (less than 5k bytes when exported as CSV) that are only "from to" (e.g. code to name), and must be used in a JOIN, just to translate internal codes or IDs... How to use CREATE TABLE
with them on Hive?
Example:
CREATE TABLE mydb.fromto1(id1 bigint, name1 string);
CREATE TABLE mydb.fromto2(
id2 bigint,
name2 varchar(10)
)
PARTITIONED BY (ingestion_day date)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 'hdfs://TLVBRPRDK/apps/hive/warehouse/mydb.db/fromto2'
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.row.index.stride'='50000',
'orc.stripe.size'='67108864',
'transient_lastDdlTime'='1577456923'
);
-- INSERT INTO mydb.fromto1 10 lines
-- INSERT INTO mydb.fromto2 10 lines
CREATE VIEW mydb.vw_test1 AS -- need for BEST PERFORMANCE HERE!
SELECT big.*, tiny.name1
FROM mydb.big_fact_table big INNER JOIN mydb.fromto1 tiny ON big.id1=tiny.id1
-- and/or INNER JOIN mydb.fromto2 tiny2 ON big.id2=tiny2.id2
;
How to set correc parameters (partitioned or not, compressed or not, managed or external, row format, etc.) for best performance in a SQL JOIN with Big Data (fact) tables?
Is there a "good Fast Guide" or Wizard for it?
NOTES:
- This question/ansewer is not the same. Perhaps there are clues in optiomizations for "Hive Star-schema JOINs", but not here.
- There are some clues here on cwiki.Apache/Hive/LanguageManual+JoinOptimization, but it is not about CREATE TABLE.