0

I am working on the problem where I have a large number of small compressed text file. Each file size is approx 10-20kb and have TBs of data. I need to load these files into Hive. Later, Tableau will use HIVE tables for its report generation. I am using AWS.

What is the best way to load data into hive. My call is

  1. move compressed data into mappers
  2. Decompressed them using Map only job.
  3. Process those txt files.
  4. Create an Hive table
  5. Load data from mappers into hive table. (My concern lies on this step. As per my understanding, it is possible that data can be loaded into Hive tables using multiple mappers but not so sure)
  6. use hive tables in the reporting tool.

Please suggest, is there any better way to handle this scenario.

Thanks

Ajay
  • 783
  • 3
  • 16
  • 37
  • If you colud merge the small text files, I suggest you to merge them to the size above 100M, which would be more suitable to do `map-reduce` work and hive query as well later. – luoluo Sep 22 '15 at 08:07
  • I can't merge these text files. I have to use these text files as is. What's ur opinion about my approach? Is that doable? – Ajay Sep 22 '15 at 09:50
  • You don't want to add all of these files to the HDFs. By my calculation you'd be adding 10s of millions of files to the HDFS which is not a good idea. As luoluo suggests, you need to merge these as soon as possible and remove the small files. I'd advise doing this before you add the files to the HDFS. See some answers at http://stackoverflow.com/questions/3548259/merging-multiple-files-into-one-within-hadoop. – Ben Watson Nov 12 '15 at 08:55

0 Answers0