2

if i have more number of small files need to store in hive tables. which file format is better way to store and why ?

sravanthi
  • 21
  • 1

2 Answers2

0

you can this mapred.job.reuse.jvm.num.tasks to improved.below link is useful https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

reuse JVM in Hadoop mapreduce jobs

0

Using of inefficient file formats, for example TextFile format and storing data without compression compounds the small file issue, affecting performance and scalability in different ways. If for example, you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created.

Use Hive Concatenate Functionality :

This approach will be helpful when data is stored in Hadoop and hive tables are built over it. Basically, Apache Hive provides a command to merge small files into a larger file inside a partition. Here’s how that command looks like:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

This works only if data files are stored in RC or ORC formats.

kunal218
  • 31
  • 6