Dealing with data inside TFIDF Sparevector or saving it to Dataframe or external file

Question

I am calculating TF and IDF using HashingTF and IDF of Pyspark using the following code:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

sc = SparkContext()

# Load documents (one per line).
documents = sc.textFile("random.txt").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()

idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

The question is: I can print tfidf on the screen using collect() method but how can I access specific data inside it or save the whole tfidf vectorspace to an external file or Dataframe?

score 0 · Accepted Answer · edited May 23 '17 at 12:16

HashingTF and IDF return RDD where each element is pyspark.mllib.linalg.Vector (org.apache.spark.mllib.linalg.Vector in Scala)*. It means that:

you can access individual indices using simple indexing. For example:

documents = sc.textFile("README.md").map(lambda line: line.split(" "))
tf = HashingTF().transform(documents)
idf = IDF().fit(tf)
tfidf = idf.transform(tf)

v = tfidf.first()
v
## SparseVector(1048576, {261052: 0.0, 362890: 0.0, 816618: 1.9253})

type(v)
## pyspark.mllib.linalg.SparseVector

v[0]
## 0.0

can be saved directly to text file. Vectors provide meaningful string representation and parse method which can be used to restore original structure.
```
from pyspark.mllib.linalg import Vectors

tfidf.saveAsTextFile("/tmp/tfidf")
sc.textFile("/tmp/tfidf/").map(Vectors.parse)
```

can be placed in a DataFrame

df = tfidf.map(lambda v: (v, )).toDF(["features"])

## df.printSchema()
## root
## |-- features: vector (nullable = true)

df.show(1, False)
## +-------------------------------------------------------------+
## |features                                                     |
## +-------------------------------------------------------------+
## |(1048576,[261052,362890,816618],[0.0,0.0,1.9252908618525775])|
## +-------------------------------------------------------------+
## only showing top 1 row

HashingTF is irreversible so it cannot be used to extract information about specific tokens. See How to get word details from TF Vector RDD in Spark ML Lib?

This greatly helps @zero323 , but I do not want to save it as a complex structure. The problem remains: how to extract/access (in your example) the part [261052,362890,816618] and part [0.0,0.0,1.9252908618525775] to save them separately or to make some calculations on them. — K.Ali, Feb 24 '16 at 10:20

Dealing with data inside TFIDF Sparevector or saving it to Dataframe or external file

1 Answers1