0

I have a huge dataframe containing millions of rows. From these rows I derive new k dataframes which have only 1 row and 1 column. What's a good way to concatenate these k dataframes together so as to now get a a dataframe 1 x k that has 1 row and k columns.

  1. In the past I started with using a crossJoin among all the k dataframes, such as df1.crossJoin(df2).crossJoin(df3).crossJoin(dfk)

    This resulted in a broadcast timeout error,

  2. Later I moved to what I thought is a smarter solutions.

    df1.withColumn("temp_id", lit(0)).join(df2.withColumn("temp_id", lit(0)), "temp_id").drop("temp_id").

    This resulted in a weirder yet similar error of broadcast timeout.

The result that I really want is a new DataFrame with 1 row and k columns which in numpy/pandas language could be

pandas.concat(..., axis=1) OR np.vstack(.....)

praateek
  • 11
  • 1
  • pd.concat would be my guess, but without any data it is hard to determine the best way. – Edeki Okoh May 03 '19 at 23:58
  • Keeping in mind that the main difference of pandas and spark is that the last one is a distributed system the errors you mentioned might be a result of a big shuffle. Please provide a full description as shown here: https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples – abiratsis May 04 '19 at 09:39

1 Answers1

0

The operation I think you want to perform is a "zip" operation. Spark does not provide this method for Dataframes, but you can see how it works with the following example (A Spark version follows this example):

scala> val l1 = List("a", "b")
l1: List[String] = List(a, b)

scala> val l2 = List(1,2)
l2: List[Int] = List(1, 2)

scala> val zipped = l1.zip(l2)
zipped: List[(String, Int)] = List((a,1), (b,2))

scala> zipped.foreach(println)
(a,1)
(b,2)

scala> 

How to do this in Spark has already been answered here: How to zip two (or more) DataFrame in Spark

Basically, you do this:

val zippedRDD = df1.rdd.zip(df2.rdd)

this will leave you with an RDD, which you can convert to a DF or DS as needed in the usual way.

GMc
  • 1,764
  • 1
  • 8
  • 26
  • @praateek if this has helped you, could you please accept the answer (click on the grey check mark next to the answer) and upvote it? Thanks. – GMc May 13 '19 at 11:05