2

Hello I'm fairly new using Spark and it's Data Collections. I'm running Spark's tf-idf example code and I'm at this point where I have my results stored in a DataFrame as following:

>>> rescaledData.show()
+-----+--------------------+--------------------+--------------------+--------------------+
|label|            sentence|               words|         rawFeatures|            features|
+-----+--------------------+--------------------+--------------------+--------------------+
|    0|Hi I heard about ...|[hi, i, heard, ab...|(20,[0,5,9,17],[1...|(20,[0,5,9,17],[0...|
|    0|I wish Java could...|[i, wish, java, c...|(20,[2,7,9,13,15]...|(20,[2,7,9,13,15]...|
|    1|Logistic regressi...|[logistic, regres...|(20,[4,6,13,15,18...|(20,[4,6,13,15,18...|
+-----+--------------------+--------------------+--------------------+--------------------+

>>> rescaledData.select("features").rdd.collect()
[Row(features=SparseVector(20, {0: 0.6931, 5: 0.6931, 9: 0.2877, 17: 1.3863})), Row(features=SparseVector(20, {2: 0.6931, 7: 0.6931, 9: 0.863, 13: 0.2877, 15: 0.2877})), Row(features=SparseVector(20, {4: 0.6931, 6: 0.6931, 13: 0.2877, 15: 0.2877, 18: 0.6931}))]

Is it possible to find the 'most important' word (the one with the highest tf-idf value) of each of the sentences in my data set? For example in my second sentence, the token with the highest value (0.863) is token no 9 -> 'Java'. How can I calculate the above?

mtoto
  • 23,919
  • 4
  • 58
  • 71
kassnl
  • 249
  • 1
  • 4
  • 13

0 Answers0