1

I'm new to Hive and trying to optimize a query that is taking a while to run. I have identical calls to regexp_extract and get_json in my SELECT and WHERE statements, and I was wondering if there is a way to optimize this by storing the results from one statement and using them in the other (or if Hive is already doing something like this in the background).

Example query:

SELECT
    regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
FROM
   api_request_logs
WHERE
    LENGTH(regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)')) > 0

Thanks!

cardinality
  • 13
  • 1
  • 3
  • Did u consider using RegexSerde? I believe it can help - for example create a table with your desired regex and 'insert as select' into it. see http://stackoverflow.com/questions/9102184/regex-for-access-log-in-hive-serde – belostoky Jan 26 '17 at 10:13

1 Answers1

0

You can use a derived table to specify the regex only once but I don't think it runs faster

select * from (
    select regexp_extract(get_json(json, 'url'), '.*[&?]q=([^&]*)') as query
    from api_request_logs
) t where length(query) > 0
FuzzyTree
  • 32,014
  • 3
  • 54
  • 85