How do I set the parquet file size? I've tried tweaking some settings, but ultimately I get a single large parquet file.
I've created a partitioned external table and then insert into it via an insert overwrite statement.
SET hive.auto.convert.join=false;
SET hive.support.concurrency=false;
SET hive.exec.reducers.max=600;
SET hive.exec.parallel=true;
SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET mapreduce.map.output.compress=false;
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec;
SET hive.groupby.orderby.position.alias=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.optimize.sort.dynamic.partition=true;
SET hive.resultset.use.unique.column.names=false
SET mapred.reduce.tasks=100;
SET dfs.blocksize=268435456;
SET parquet.block.size=268435456;
INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT dt, x, sum(y) FROM managed_table GROUP BY dt, x;
Using the dfs.blocksize and parquet.block.size parameters, I was hoping to generate 256 mb parquet file splits, but I'm getting a single 4 gb parquet file. Howe