3

Are there efficient ways to process data column-wise (vs row-wise) in spark?

I'd like to do some whole-database analysis of each column. I'd like to iterate through each column in a database and compare it to another column with a significance test.

colA = "select id, colA from table1"

foreach table, t:
   foreach id,colB in t: # "select id, colB from table2"
     # align colA,colB by ID
     ab = join(colA,colB)
     yield comparefunc(ab)

I have ~1M rows but ~10k columns. Issuing ~10k selects is very slow, but shouldn't I be able to do a select * and broadcast each column to a different node for processing.

user48956
  • 14,850
  • 19
  • 93
  • 154
  • 1
    Have you considered transposing (see http://stackoverflow.com/questions/37864222/transpose-column-to-row-with-spark) your RDD and then working on rows instead? – Rick Moritz May 16 '17 at 17:16
  • The big downside of K,V representations is that the type of the value is aliased (because spark column have a single type). Its *possible* to convert everything into doubles (e.g. replacing strings with a label ID). But it ain't pretty. – user48956 Jul 14 '17 at 17:24
  • I've been thinking that converted the data into parquet files (one per column) may help. Though the implementation seems very broken for py-spark: https://issues.apache.org/jira/browse/SPARK-21392 – user48956 Jul 14 '17 at 17:27

0 Answers0