say for example I have a dataframe in the following format (in reality is a lot more documents):
df.show()
//output
+-----+-----+-----+
|doc_0|doc_1|doc_2|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 0.0| 1.0| 0.0|
+-----+-----+-----+
| 2.0| 0.0| 1.0|
+-----+-----+-----+
// ngramShingles is a list of shingles
println(ngramShingles)
//output
List("the", "he ", "e l")
Where the ngramShingles
length is equal to the size of the dataframes columns.
How would I get to the following output?
// Desired Output
+-----+-----+-----+-------+
|doc_0|doc_1|doc_2|shingle|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "the"|
+-----+-----+-----+-------+
| 0.0| 1.0| 0.0| "he "|
+-----+-----+-----+-------+
| 2.0| 0.0| 1.0| "e l"|
+-----+-----+-----+-------+
I have tried to add a column via the following line of code:
val finalDf = df.withColumn("shingle", typedLit(ngramShingles))
But that gives me this output:
+-----+-----+-----+-----------------------+
|doc_0|doc_1|doc_2| shingle|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 0.0| 1.0| 0.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
| 2.0| 0.0| 1.0| ("the", "he ", "e l")|
+-----+-----+-----+-----------------------+
I have tried a few other solutions, but really nothing I have tried even comes close. Basically, I just want the new column to be added to each row in the DataFrame.
This question shows how to do this, but both answers rely on having a one column already existing. I don't think I can apply those answers to my situation where I have thousands of columns.