1

I am filtering on a string column called mycolumn. I can think of the following 3 ways to filter, which will be achieve better performance ?

-- method #1
where
(   mycolumn = 'FixedStringA.FixedStringB.VariableStringA.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringB.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringC.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringD.FixedString' );

-- method #2
where mycolumn like '%//.FixedString';

-- method #3
where split(mycolumn,'//.')[3] = 'FixedString';

Please know that FixedStringA and FixedStringB are like constants , their values will remain fixed, that's why calling them fixed strings. And mycolumn is not a partition key for your info.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Dhiraj
  • 3,396
  • 4
  • 41
  • 80
  • I suggest method #1 if your filter is static all the time, which save time/performance for you. Or if you have situation where you need to change your filter condition it is good to go with method #2 & method #3 which are a kind of regex type. – sangam.gavini Oct 04 '19 at 08:12

1 Answers1

0

In the method 3 it should be \\. not //. See this answer: Dot in regexp

Also in the method 2 it is probably should be like '%.FixedString'

Also you can use rlike:

RLIKE '\\.FixedString$'

You will not notice the difference in Hive because it is full-scan anyway and it is running in parallel.

I'd prefer RLIKE '\\.FixedString$' because it is short and powerful regexp-like.

splitting and producing array (method 3) will create more pressure on memory and garbage collector, you will create and free much more objects.

method 1 is also good, but looks not so elegant. Method 1 can be better if you have internal indexes like in ORC file, you need to check it yourself.

leftjoin
  • 36,950
  • 8
  • 57
  • 116