I am trying to generate all combination of unique values within my spark dataframe. The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough. Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function. Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)