1

Problem Statement:-

I need to compare two tables Table1 and Table2 and they both store same thing. So I need to compare Table2 with Table1 as Table1 is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2 has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL to do the comparisons and get the data back.

So my question is which is better in terms of PERFORMANCE, writing a CUSTOM MAPPER and REDUCER to do this kind of job or the HiveQL that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.

arsenal
  • 23,366
  • 85
  • 225
  • 331
  • This seems pretty straightforward for a Hive query. I would suggest doing that first. – Olaf Jul 10 '12 at 17:13
  • Yup I will be doing that, but in general which will be better in terms of performance as few of my team mates were saying Custom Mapper and Reducer will be slightly faster as compared to HiveQL? – arsenal Jul 10 '12 at 17:25
  • In general, we found that Hive generates pretty decent Map/Reduce jobs. It is even sometimes possible to hint to Hive precicely how you want it to generate those jobs. I would imagine that a custom written MR job can under certain conditions outperfrom Hive, but I would not do it without a serious costs/benefits analysis. – Olaf Jul 10 '12 at 17:30
  • ummm, let me study that more. I have one more question that I have posted on SO [http://stackoverflow.com/questions/11405446/find-10-latest-record-for-each-buyer-id-for-yesterdays-date](http://stackoverflow.com/questions/11405446/find-10-latest-record-for-each-buyer-id-for-yesterdays-date) It will be of great help to me, if you can take a look. As I am new to HiveQL, so having a lot of problem. – arsenal Jul 10 '12 at 17:32
  • Well, if you cannot express your request in the Hive dialect of SQL, you don't have an option other than writing a custom MR job, do you? BTW, Hive doesn't have a built-in DATETIME data type, but the way your data is formatted, you should be fine (I assume the date in your sample is July 9th, not Septeber 7th). – Olaf Jul 10 '12 at 17:58
  • Yup its July 9th basically. I am not sure what has to be done in this case. Is there any direct way to achieve this? – arsenal Jul 10 '12 at 18:05

1 Answers1

2

The answer to your question is two-fold.

Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.

Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.

Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.

Mark Grover
  • 4,070
  • 22
  • 21
  • So basically you are saying, considering my scenario that I need to compare two tables and they have millions of data in them. So It will be fine working with HiveQL as compared to writing Custom Map Reduce jobs for my case? – arsenal Jul 11 '12 at 04:10
  • Yes, you are; if your query is easily expressible in HQL. – Mark Grover Jul 11 '12 at 14:36