I need to classification a lot of products in a category tree, I'm testing with Spark and Mlib Naive Bayes. But I don't understand how I can calculate the TF-IDF.
I have a trainer file like this:
#filenameTrainer:
103,355 4 50 60 71 72 66 73 57 53
103,35 45 55 65 75 85 66 73 57 53
104,355 41 51 61 71 72 67 73 58 54
etc.etc.
Where the first column is the category id, and the others are the words converted into an index.
this is the (pseudo) code than I use for training:
val conf = new SparkConf()
.setAppName("SparseNaiveBayes test")
.setMaster("local[1]")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val trainData = MLUtils.loadLabeledPoints(sc, filenameTrainer);
val trained:NaiveBayesModel = NaiveBayes.train(trainData);
well, if I try to search a category:
val testData:Vector = Vectors.dense(Array[Double](3, 35,45,55,65,75,85,66,73,92 ))
val result:Double = trained.predict(testData)
println("Result = " + result)
The result is correct, it is return a category 103: Result = 103.0
Now the question is, how can I calculate the TF–IDF for the trainer file?