6

Is there a way to append a dataframe horizontally to another one - assuming both have identical number of rows?

This would be the equivalent of pandas concat by axis=1;

result = pd.concat([df1, df4], axis=1) 

or the R cbind

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

1 Answers1

8

There wouldn't be one. Unlike Pandas DataFrame, Spark DataFrame is more a relation, and has no inherent order.

There is known pattern, where you convert data to RDD, zipWithIndex (PySpark DataFrames - way to enumerate without converting to Pandas?) and then join using index field, but it is ultimately an antipattern*.


* If we don't explicitly guarantee specific order (and who know what happens under the hood with all new bells and whistles like cost based optimizer and custom optimizer rules) then it can easily become brittle and silently fail in some unexpected way.

  • Your comments (/answer) here ring true. I'll wait to see what other ideas come in. The `zipwithIndex` will actually likely be the approach in the end. ooc why do you call it an `antipattern`? – WestCoastProjects Apr 10 '18 at 21:40
  • If we don't explicitly guarantee specific order (and who know what happens under the hood with all new bells and whistles like cost based optimizer and custom optimizer rules) then it can easily become brittle and silently fail in some unexpected way. –  Apr 10 '18 at 21:42
  • hmm, i'll think about that wrt `zipWithIndex`: in any case it is always possible to provide an explicit ordering preceding that operation. – WestCoastProjects Apr 10 '18 at 21:46
  • hey user9627366 where did you go to? why did you disappear? (i had upvoted you and you were in pole position for getting accepted answer..) – WestCoastProjects Apr 10 '18 at 22:54