0

I have a dataFrame with 18 rows of records, and this dataFrame has like 20+ columns. For example:

-----------                My list: ('N','N')
A   B   C
-----------
a   b   c
d   e   f

I also have a list with 18 values. Now I want to add this list to my dataFrame, each value in a list correspond a value to the row.

That means the final result should be like this:

--------------
A   B   C   D
--------------
a   b   c   N
d   e   f   N

Here is what I tried(From this link):

//C is a list of values
val rdd = sc.parallelize(C)

//joindf is my dataframe has 20+ columns
val rdd_new = joindf.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
sqlContext.createDataFrame(rdd_new,joindf.schema.add("CD",StringType)).show

This gives me error like this:Can't zip RDDs with unequal numbers of partitions: List(200,2)

Any help would be appreciated!

UPDATE

Not sure why the partition or the zip doesn't work out, but the comments provide another way to do this. I just duplicate the method from this link

Anna
  • 443
  • 9
  • 29
  • I think you should take a look at `typedLit`. Available with Spark 2.1+ , I guess. – philantrovert Jul 26 '17 at 13:56
  • It's available with 2.2+ I am using 1.6 – Anna Jul 26 '17 at 14:00
  • Have you tried to set the partitions to be the same? `val rdd = sc.parallelize(C, joindf.rdd.partitions.size)`? – Psidom Jul 26 '17 at 15:01
  • @Psidom I just tried, and it's giving me a wave of errors, saying can only zip RDDs with same number of elements. But I checked! They both have a count of 18... Is it because my joindf has over 20 columns and C is just a list with 18 elements? – Anna Jul 26 '17 at 15:05
  • The first check should be whether the data frame has 18 rows as well. And then you might to use [this method](https://stackoverflow.com/questions/28687149/how-to-get-the-number-of-elements-in-partition) to see if each partition contain same number of elements. – Psidom Jul 26 '17 at 15:10
  • @Psidom Just checked, joindf.count is 18, also tried the methods in the link, I did a collect on both rdd and they are the same if I am not blind – Anna Jul 26 '17 at 15:20
  • Take a look at @vdep's link, which is a more general answer, when the partitions of the data frame and rdd are not the same. – Psidom Jul 26 '17 at 15:23
  • @vdep Thank you it worked... – Anna Jul 26 '17 at 15:30
  • @Psidom it worked, thank god. I do have a question tho, is it okay to use this method for large datasets? – Anna Jul 26 '17 at 15:31
  • I think that's the only method you can add a column from a rdd at the moment, so you'll have to try it out. Consider [repartition](https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4) your data frame and rdd though. – Psidom Jul 26 '17 at 15:34
  • Thanks I'll take a look at that. – Anna Jul 26 '17 at 15:45

0 Answers0