0

I am trying to get from a CSV as a Spark DataFrame and performing machine learning operations upon it. I am getting a Python serialization EOFError

conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)

#as DataFrame
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
     inferschema='true').load('myfile.csv')

#dataframe into machine learning format
r_formula = RFormula(formula = "outcome ~ .")
mldf = r_formula.fit(df).transform(df)

#fit random forest model
rf = RandomForestClassifier(numTrees = 3, maxDepth = 2)
model = rf.fit(mldf)
result = model.transform(mldf).head()
Prisha
  • 3
  • 1
  • 3
  • 1
    Hi @Prisha, it would be helpful if you posted the error stack trace - it would make your problem much easier to understand. It would also seem like [a very similar question](https://stackoverflow.com/questions/36561804/pyspark-serialization-eoferror) is already being discussed, perhaps it could help you. – taylorthurlow Jan 23 '19 at 01:29
  • Okay I will update my error – Prisha Jan 23 '19 at 01:42

0 Answers0