0

I am calculating TF and IDF using HashingTF and IDF of Pyspark using the following code:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

sc = SparkContext()

# Load documents (one per line).
documents = sc.textFile("random.txt").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()

idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

The question is: I can print tfidf on the screen using collect() method but how can I access specific data inside it or save the whole tfidf vectorspace to an external file or Dataframe?

zero323
  • 322,348
  • 103
  • 959
  • 935
K.Ali
  • 283
  • 4
  • 15

1 Answers1

0

HashingTF and IDF return RDD where each element is pyspark.mllib.linalg.Vector (org.apache.spark.mllib.linalg.Vector in Scala)*. It means that:

  • you can access individual indices using simple indexing. For example:

    documents = sc.textFile("README.md").map(lambda line: line.split(" "))
    tf = HashingTF().transform(documents)
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)
    
    v = tfidf.first()
    v
    ## SparseVector(1048576, {261052: 0.0, 362890: 0.0, 816618: 1.9253})
    
    type(v)
    ## pyspark.mllib.linalg.SparseVector
    
    v[0]
    ## 0.0
    
  • can be saved directly to text file. Vectors provide meaningful string representation and parse method which can be used to restore original structure.

    from pyspark.mllib.linalg import Vectors
    
    tfidf.saveAsTextFile("/tmp/tfidf")
    sc.textFile("/tmp/tfidf/").map(Vectors.parse)
    
  • can be placed in a DataFrame

    df = tfidf.map(lambda v: (v, )).toDF(["features"])
    
    ## df.printSchema()
    ## root
    ## |-- features: vector (nullable = true)
    
    df.show(1, False)
    ## +-------------------------------------------------------------+
    ## |features                                                     |
    ## +-------------------------------------------------------------+
    ## |(1048576,[261052,362890,816618],[0.0,0.0,1.9252908618525775])|
    ## +-------------------------------------------------------------+
    ## only showing top 1 row
    
  • HashingTF is irreversible so it cannot be used to extract information about specific tokens. See How to get word details from TF Vector RDD in Spark ML Lib?

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • This greatly helps @zero323 , but I do not want to save it as a complex structure. The problem remains: how to extract/access (in your example) the part [261052,362890,816618] and part [0.0,0.0,1.9252908618525775] to save them separately or to make some calculations on them. – K.Ali Feb 24 '16 at 10:20