which is the most optimal way to filter out string column in Hive?

Question

I am filtering on a string column called mycolumn. I can think of the following 3 ways to filter, which will be achieve better performance ?

-- method #1
where
(   mycolumn = 'FixedStringA.FixedStringB.VariableStringA.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringB.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringC.FixedString' 
OR mycolumn = 'FixedStringA.FixedStringB.VariableStringD.FixedString' );

-- method #2
where mycolumn like '%//.FixedString';

-- method #3
where split(mycolumn,'//.')[3] = 'FixedString';

Please know that FixedStringA and FixedStringB are like constants , their values will remain fixed, that's why calling them fixed strings. And mycolumn is not a partition key for your info.

I suggest method #1 if your filter is static all the time, which save time/performance for you. Or if you have situation where you need to change your filter condition it is good to go with method #2 & method #3 which are a kind of regex type. — sangam.gavini, Oct 04 '19 at 08:12

leftjoin · Answer 1 · 2019-10-04T09:13:26.020

In the method 3 it should be \\. not //. See this answer: Dot in regexp

Also in the method 2 it is probably should be like '%.FixedString'

Also you can use rlike:

RLIKE '\\.FixedString$'

You will not notice the difference in Hive because it is full-scan anyway and it is running in parallel.

I'd prefer RLIKE '\\.FixedString$' because it is short and powerful regexp-like.

splitting and producing array (method 3) will create more pressure on memory and garbage collector, you will create and free much more objects.

method 1 is also good, but looks not so elegant. Method 1 can be better if you have internal indexes like in ORC file, you need to check it yourself.

which is the most optimal way to filter out string column in Hive?

1 Answers1