0

I am selecting a subset of data from a larger dataframe.

dataset = df.select('RatingScore',
             'CategoryScore',
             'CouponBin',
             'TTM',
             'Price',
             'Spread',
             'Coupon', 
             'WAM', 
             'DV')

dataset = dataset.fillna(0)
dataset.show(5,True)
dataset.printSchema()

Now, I fee that into my KMeans model

from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
import numpy as np

data_array=np.array(dataset)

#data_array =  np.array(dataset.select('RatingScore', 'CategoryScore', 'CouponBin', 'TTM', 'Price', 'Spread', 'Coupon', 'WAM', #'DV').collect())

# Build the model (cluster the data)
clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = data_array.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

This line: clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")

Throws this error: AttributeError: 'numpy.ndarray' object has no attribute 'map'

From the code, you can see that I tried to create the array two different ways. Neither worked. If I try to fee in the items straight from the subset-dataframe, I get this error:

AttributeError: 'DataFrame' object has no attribute 'map'

What am I missing here?

ASH
  • 20,759
  • 19
  • 87
  • 200
  • Neither of those objects have a `.map` attribute, or for that matter, a `reduce` – juanpa.arrivillaga Jan 29 '20 at 14:49
  • Does this answer your question? [pandas 'DataFrame' object has no attribute 'map'](https://stackoverflow.com/questions/51744786/pandas-dataframe-object-has-no-attribute-map) – AMC Feb 08 '20 at 01:05
  • https://stackoverflow.com/questions/54607989/pandas-attributeerror-dataframe-object-has-no-attribute-map/54608192, https://stackoverflow.com/questions/39535447/attributeerror-dataframe-object-has-no-attribute-map – AMC Feb 08 '20 at 01:06

1 Answers1

0

I think there are two ways:

  1. convert the pandas.DataFrame into a spark_df.rdd as suggested in other similar situations
  2. convert the pandas.DataFrame into multiple pandas.Series according to its official doc
Hanchen
  • 63
  • 1
  • 5