0

I need to classification a lot of products in a category tree, I'm testing with Spark and Mlib Naive Bayes. But I don't understand how I can calculate the TF-IDF.

I have a trainer file like this:

#filenameTrainer:
103,355 4 50 60 71 72 66 73 57 53
103,35 45 55 65 75 85 66 73 57 53
104,355 41 51 61 71 72 67 73 58 54

etc.etc.

Where the first column is the category id, and the others are the words converted into an index.

this is the (pseudo) code than I use for training:

val conf = new SparkConf()
  .setAppName("SparseNaiveBayes test")
  .setMaster("local[1]")
  .set("spark.executor.memory", "2g")

val sc = new SparkContext(conf)
val trainData = MLUtils.loadLabeledPoints(sc, filenameTrainer);
val trained:NaiveBayesModel = NaiveBayes.train(trainData);

well, if I try to search a category:

val testData:Vector =  Vectors.dense(Array[Double](3, 35,45,55,65,75,85,66,73,92 ))
val result:Double = trained.predict(testData)
println("Result = " + result)

The result is correct, it is return a category 103: Result = 103.0

Now the question is, how can I calculate the TF–IDF for the trainer file?

zero323
  • 322,348
  • 103
  • 959
  • 935
faster2b
  • 662
  • 6
  • 22
  • 1
    possible duplicate of [How can I create a TF-IDF for Text Classification using Spark?](http://stackoverflow.com/questions/24548290/how-can-i-create-a-tf-idf-for-text-classification-using-spark) – eliasah Jul 08 '15 at 15:26
  • @eliasah yes, the question is similar, do you have resolved? the answers did not solve the problem – faster2b Jul 08 '15 at 15:41
  • Did you read the answers? The given answers are a good start. I actually solved it but I didn't have time to write a proper explained answer – eliasah Jul 08 '15 at 17:12

0 Answers0