2

say for example I have a dataframe in the following format (in reality is a lot more documents):

df.show()

//output
    +-----+-----+-----+
    |doc_0|doc_1|doc_2|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  0.0|  1.0|  0.0|
    +-----+-----+-----+
    |  2.0|  0.0|  1.0|
    +-----+-----+-----+

// ngramShingles is a list of shingles
println(ngramShingles)

//output
    List("the",  "he ", "e l")

Where the ngramShingles length is equal to the size of the dataframes columns.

How would I get to the following output?

// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "the"|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|  "he "|
+-----+-----+-----+-------+
|  2.0|  0.0|  1.0|  "e l"|
+-----+-----+-----+-------+

I have tried to add a column via the following line of code:

val finalDf = df.withColumn("shingle", typedLit(ngramShingles))

But that gives me this output:

+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2|                shingle|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  0.0|  1.0|  0.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
|  2.0|  0.0|  1.0|  ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+

I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.

This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.

Cold Fish
  • 242
  • 3
  • 13
  • Does this answer your question? [Apache Spark how to append new column from list/array to Spark dataframe](https://stackoverflow.com/questions/44395873/apache-spark-how-to-append-new-column-from-list-array-to-spark-dataframe) – ernest_k Jul 29 '21 at 05:08
  • The issue with that answer is that both the answers there (I think) explicitly work only with 1 existing column. In my DataFrame I have a bunch (thousands) of columns, so it's not feasible to do it in the same way. – Cold Fish Jul 29 '21 at 05:09
  • @Yvihs, the accepted answer from the said link actually works for multiple existing columns, as `r._1` represents a `Row` of all columns of the original DataFrame. However, the suggested `zip` in the solution might fail due to `unequal numbers of partitions` -- which could be remedied by a `join` of suitably indexed RDDs. – Leo C Jul 29 '21 at 17:13

1 Answers1

1

You could make dataframe from your list and then join two dataframes together. TO do join you'd need to add an additional column, that would be used for join (can be dropped later):

val listDf = List("the",  "he ", "e l").toDF("shingle")

val result = df.withColumn("rn", monotonically_increasing_id())
   .join(listDf.withColumn("rn", monotonically_increasing_id()), "rn")
   .drop("rn")

Result:

+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
|  0.0|  1.0|  0.0|    the|
|  0.0|  1.0|  0.0|    he |
|  2.0|  0.0|  1.0|    e l|
+-----+-----+-----+-------+
Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76