3

I'm new in pyspark. I would like to perform some machine Learning on a text file.

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'

enter image description here

can anyone help me please. thank you

A.Dorra
  • 41
  • 1
  • 2
  • 7
  • post complete traceback of the error – Arpit Solanki Feb 26 '18 at 14:03
  • i posted a screen shot of the resulting error. thank you – A.Dorra Feb 26 '18 at 14:14
  • `CountVectorizer` of pyspark. **ml** works on dataframes, not on RDDs (see [examples](https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer) and [docs](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer)). – desertnaut Feb 26 '18 at 14:49
  • my input file is a a text without any structure. – A.Dorra Feb 26 '18 at 15:26
  • Here is an example of my text file "alt.atheism alt atheism faq atheist resources archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and assorted other atheist paraphernalia are available from the freedom from religion foundation in the us write to ffrf p o box madison " – A.Dorra Feb 26 '18 at 15:27

1 Answers1

2

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

But I would suggest you to stick with dataframe way.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97