I am working on a project where I need to perform K-means clustering on a large-scale dataset using PySpark
. The dataset consists of millions of rows and has thousands of feature columns. I have successfully loaded the data into a PySpark
DataFrame
, but I am encountering issues while trying to prepare the data for K-means clustering.
I have attempted to follow various tutorials and examples, but I keep running into errors related to column resolution, column indexing, or other transformations. The main challenge seems to be converting the feature columns to the appropriate format required for K-means clustering. The dimensions of the data are the following: (6883, 9995). As you can see it is a very large dataset and I am perplexed as to why the K-means fitting and predicting fails.
from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()
file_location = "/user/dorwayne/bgc_features_part0001.tsv"
file_type = "tsv"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
bgc = spark.read.format("csv").option("inferSchema", "False").option("delimiter", "\\t").option("header","true").load("dbfs:/user/dorwayne/bgc_features_part0001.tsv") \
assembler = VectorAssembler(inputCols = bgc.columns, outputCol = "features")
assembled_data = assembler.transform(bgc)
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(assembled_data)
predictions = model.transform(assembled_data)
However the output is a large error:
Py4JJavaError: An error occurred while calling o8473.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 43.0 failed 4 times, most recent failure: Lost task 2.3 in stage 43.0 (TID 147) (ip-10-131-129-26.ec2.internal executor driver): java.lang.AssertionError: assertion failed