1

I have many small tables (less than 5k bytes when exported as CSV) that are only "from to" (e.g. code to name), and must be used in a JOIN, just to translate internal codes or IDs... How to use CREATE TABLE with them on Hive?

Example:

  CREATE TABLE mydb.fromto1(id1 bigint, name1 string);

  CREATE TABLE mydb.fromto2(
     id2 bigint,
     name2 varchar(10)
  )
  PARTITIONED BY (ingestion_day date)
  ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  LOCATION 'hdfs://TLVBRPRDK/apps/hive/warehouse/mydb.db/fromto2'
  TBLPROPERTIES (
     'orc.compress'='SNAPPY', 
     'orc.row.index.stride'='50000',
     'orc.stripe.size'='67108864',
     'transient_lastDdlTime'='1577456923'
  );

  -- INSERT INTO mydb.fromto1 10 lines
  -- INSERT INTO mydb.fromto2 10 lines

  CREATE VIEW mydb.vw_test1 AS -- need for BEST PERFORMANCE HERE!
    SELECT big.*, tiny.name1 
    FROM mydb.big_fact_table big INNER JOIN mydb.fromto1 tiny ON big.id1=tiny.id1
     -- and/or INNER JOIN mydb.fromto2 tiny2 ON big.id2=tiny2.id2
  ;

How to set correc parameters (partitioned or not, compressed or not, managed or external, row format, etc.) for best performance in a SQL JOIN with Big Data (fact) tables?

Is there a "good Fast Guide" or Wizard for it?


NOTES:

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304

1 Answers1

1
  1. Definitely you do not need partitioning for such small tables. Better if each table will be in single file, not partitioned, not bucketed.
  2. Use these settings for joins optimization (increase figures if necessary). Check EXPLAIN plan, it should be mapjoin operator, small tables can be joined on the same mapper.
    set hive.auto.convert.join=true;
    set hive.mapjoin.smalltable.filesize=157286400; --if the file size is smaller than this threshold, map join will be used
    
    set hive.auto.convert.join.noconditionaltask = true;
    set hive.auto.convert.join.noconditionaltask.size = 157286400; --combined small tables size
  1. Using TEXTFILE for small tables may be better than ORC because plain TEXTFILE can be be smaller in size for such small tables. The same rule for compression - use compression only if it helps to significantly reduce the file size, small files are not always can be compressed efficiently (compressed small file can even be bigger than uncompressed). Use ORC for bigger dimensions. Check file size and decide.
    Bear in mind that fastest SerDe is LasySimpleSerDe, so default tab-delimited TEXTFILE is good for small files. For bigger files use ORC and compression.

  2. External or managed - does not matter in this context.

leftjoin
  • 36,950
  • 8
  • 57
  • 116