3

I have a requirement where I want to split 5GB ORC file into 5 files with 1 GB size each. ORC file is splittable. Does that mean we can only split file stripe by stripe ? but I have requirement where I want to split orc file based on size. for ex.split 5GB ORC file into 5 files with 1 GB size each. if possible please share example.

Sham Desale
  • 51
  • 1
  • 3

1 Answers1

3

A common approach and considering that you file size can be 5GB, 100GB, 1TB, 100TB, etc. You might want to mount a Hive table pointing to this file and define one more table pointing to a different directory, then run an insert from one table to the other using insert statement provided by Hive.

At the beginning of the script, make sure you have the following Hive flags:

set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=1073741824;
set hive.merge.size.per.task=1073741824;

In this way, the output average for each reducer will be 1073741824 Bytes which is equal to 1GB.

If you want to use only Java code, play with these flags:

mapred.max.split.size
mapred.min.split.size

Please check these, they are very useful:

Community
  • 1
  • 1
dbustosp
  • 4,208
  • 25
  • 46
  • Thanks for your reply. Is there any way I can do splitting using core java only not hive ? – Sham Desale Mar 06 '17 at 07:40
  • I need a solution which is entirely based on Core java api. Not hadoop or mapreduce. Anyways thank you very much for taking time to respond' – Sham Desale Mar 07 '17 at 08:19
  • @ShamDesale remove the tags from the question then. Remove hadoop, apache-crunch and apache given that the question has nothing to do with hadoop. – dbustosp Mar 07 '17 at 08:25
  • 1
    Let me rephrase my question - I am reading ORC file in java and then split this file based on size. for instance if file size is 5GB then I need to create 5 files with 1GB size each. I am able to do this using java. only problem here is that original files stripe size is different and split file stripe size is different. I want to set original files stripe size to all split files. How I can retrieve stripe size of file using orcreader in java ? Please reply – Sham Desale Mar 14 '17 at 14:14