4

Currently hive does support non equi join. But as the cross product becomes pretty huge, I was wondering what are the options to tackle a large fact(257 billion rows, 37 tb) and relatively smaller(8.7 gb) dimension table join.

In case of equi join I can make it work easily with proper bucketing on the join column/columns . (using same number of buckets for SMBM practically converting to a map join). But if we think this wont be of any advantage when its a non equi join, because the values will be there in other buckets, practically triggering a shuffle i.e. a reduce phase.

If any one has any thoughts to overcome this, please suggest .....

user3123372
  • 704
  • 1
  • 10
  • 26

1 Answers1

-1

If the dimension table fits in memory, you can create a Custom User Defined Function (UDF) as stated here, and perform the inequi-join in memory.

Javier
  • 74
  • 7
  • the size is close to 10 gb , so its understood that it wont fit in memory – user3123372 Jun 19 '19 at 12:49
  • I am sorry. I am working with 256Gb nodes where 10Gb dimension table would easily fit in memory. That's why I remarked "It the dimension table fits in memory" – Javier Aug 13 '19 at 06:54
  • our nodes are having 400 gb. it has to fit in the memory of one process i.e. map, reduce. not in one node – user3123372 Aug 16 '19 at 11:42
  • You can set the desired amount of YARN container memory until the maximun allowed per node just before executing the query. On our 256 Gb nodes, the default size is only 7G, but we can request as much as 212Gb. Remenber that the UDF will be instanciated on several containers and it will take several minutes to load the data in memory, but it would worth the effort. – Javier Aug 19 '19 at 06:57
  • But then you loose the parallelism . – user3123372 Aug 22 '19 at 09:46
  • You get as much parallelism as the number of concurrent containters. The worst case would be just one container per node using all the available memory. That would be as parallelism as nodes in the cluster. – Javier Aug 23 '19 at 10:06
  • Anyway, you are asking for any thoughts to overcome the fact that hive does support non equi join. Here you have one alternative with some disadvantages:It will take time to load the data on memory each time you execute the query, it will use a significant amount of memory and you won't get the maximun parallelism. It has only one advantage depending on the amount of data: It will run faster than a cross join. – Javier Aug 23 '19 at 10:21