Are there efficient ways to process data column-wise (vs row-wise) in spark?
I'd like to do some whole-database analysis of each column. I'd like to iterate through each column in a database and compare it to another column with a significance test.
colA = "select id, colA from table1"
foreach table, t:
foreach id,colB in t: # "select id, colB from table2"
# align colA,colB by ID
ab = join(colA,colB)
yield comparefunc(ab)
I have ~1M rows but ~10k columns. Issuing ~10k selects is very slow, but shouldn't I be able to do a select * and broadcast each column to a different node for processing.