How do I generate a random number _only once_ for a Hive query?

Question

I have code that looks like this:

set hivevar:foo=rand();
select ${hivevar:foo} from my_database.my_table;

I think, even with the variable substitution, rand() is still being called once per row. How do I just set foo to a random number once per query?

I can think of other solutions, but I'd like to keep everything contained to one hive query instead of patching together something uglier.

score 2 · Answer 1 · answered Sep 27 '18 at 16:22

This code calls rand() for every row:

INSERT INTO dest_table
SELECT st.col1, st.col2, RAND() AS col3
FROM source_table st;

This code calls rand() only once:

WITH rand AS (
  SELECT RAND() as col3
)
INSERT INTO dest_table
SELECT st.col1, st.col2, rand.col3
FROM rand, source_table st;

score 1 · Accepted Answer · answered Sep 27 '18 at 16:25

1

Call it in a one-row subquery and cross join with your query. The subquery should be executed once

select s.rnd, a.col, etc
  from my_database.my_table a
      cross join
  (select rand() as rnd) s
;

answered Sep 27 '18 at 16:25

leftjoin

36,950
8
57
116

Awesome. This helped me understand what was going on and what I actually wanted to do: Take the cartesian product between a scalar and a vector, like what cross join does. Thanks! However, I think I'm going to use something similar to what @VMykyt has, because it seems like cross join takes a little longer to run. – matthiasdenu Sep 27 '18 at 16:41
1

@matthiasdenu It is the same cross join because `FROM rand, source_table` without `ON` condition is a cross join. You better to use ANSI join syntax – leftjoin Sep 27 '18 at 16:45
@matthiasdenu See also this for better understanding: https://stackoverflow.com/a/46843832/2700344 – leftjoin Sep 27 '18 at 16:46
@matthiasdenu Normally it should run map-join, if not, check this:https://stackoverflow.com/a/49154414/2700344 – leftjoin Sep 27 '18 at 16:51

How do I generate a random number _only once_ for a Hive query?

2 Answers2