which is better file format to store more number of small files in hive? and why?

Question

if i have more number of small files need to store in hive tables. which file format is better way to store and why ?

score 0 · Answer 1 · answered Jun 04 '19 at 07:31

0

you can this mapred.job.reuse.jvm.num.tasks to improved.below link is useful https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

reuse JVM in Hadoop mapreduce jobs

answered Jun 04 '19 at 07:31

ved chauhan

1

score 0 · Answer 2 · answered Jun 07 '19 at 11:50

Using of inefficient file formats, for example TextFile format and storing data without compression compounds the small file issue, affecting performance and scalability in different ways. If for example, you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created.

Use Hive Concatenate Functionality :

This approach will be helpful when data is stored in Hadoop and hive tables are built over it. Basically, Apache Hive provides a command to merge small files into a larger file inside a partition. Here’s how that command looks like:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

This works only if data files are stored in RC or ORC formats.

which is better file format to store more number of small files in hive? and why?

2 Answers2