I want execute a very large amount of hive queries and store the result in a dataframe.
I have a very large dataset structured like this:
+-------------------+-------------------+---------+--------+--------+
| visid_high| visid_low|visit_num|genderid|count(1)|
+-------------------+-------------------+---------+--------+--------+
|3666627339384069624| 693073552020244687| 24| 2| 14|
|1104606287317036885|3578924774645377283| 2| 2| 8|
|3102893676414472155|4502736478394082631| 1| 2| 11|
| 811298620687176957|4311066360872821354| 17| 2| 6|
|5221837665223655432| 474971729978862555| 38| 2| 4|
+-------------------+-------------------+---------+--------+--------+
I want to create a derived dataframe which uses each row as input for a secondary query:
result_set = []
for session in sessions.collect()[:100]:
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session['visid_high'],session['visid_low'],session['visit_num'])
result = hc.sql(query).collect()
result_set.append(result)
This works as expected for a hundred rows, but causes livy to time out with higher loads.
I tried using map or foreach:
def f(session):
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session.visid_high,session.visid_low,session.visit_num)
return hc.sql(query)
test = sampleRdd.map(f)
causing PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable
. I understand from this answer and this answer that the spark context object is not serializable.
I didn't try generating all queries first, then running the batch, because I understand from this question batch querying is not supported.
How do I proceed?