Implementing K-Means on a large gene dataset

Question

I am working on a project where I need to perform K-means clustering on a large-scale dataset using PySpark. The dataset consists of millions of rows and has thousands of feature columns. I have successfully loaded the data into a PySpark DataFrame, but I am encountering issues while trying to prepare the data for K-means clustering.

I have attempted to follow various tutorials and examples, but I keep running into errors related to column resolution, column indexing, or other transformations. The main challenge seems to be converting the feature columns to the appropriate format required for K-means clustering. The dimensions of the data are the following: (6883, 9995). As you can see it is a very large dataset and I am perplexed as to why the K-means fitting and predicting fails.

from pyspark.sql import SparkSession
spark = SparkSession.getActiveSession()

file_location = "/user/dorwayne/bgc_features_part0001.tsv"
file_type = "tsv"

# CSV options

infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

bgc = spark.read.format("csv").option("inferSchema", "False").option("delimiter", "\\t").option("header","true").load("dbfs:/user/dorwayne/bgc_features_part0001.tsv")                           \
assembler = VectorAssembler(inputCols = bgc.columns, outputCol = "features")
assembled_data = assembler.transform(bgc)
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(assembled_data)
predictions = model.transform(assembled_data)

However the output is a large error:

Py4JJavaError: An error occurred while calling o8473.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 43.0 failed 4 times, most recent failure: Lost task 2.3 in stage 43.0 (TID 147) (ip-10-131-129-26.ec2.internal executor driver): java.lang.AssertionError: assertion failed

score 0 · Answer 1 · answered Aug 01 '23 at 08:00

When reading the tsv file, the option inferSchema is set to false. As no schema is supplied all fields in the dataframe bgc have the type string. But the vector assembler does not accept string columns (docs).

So either you can set inferSchema to true (which will take some time for large datasets) or you have to supply a schema when reading the file. For details please check the second part of this answer.

Implementing K-Means on a large gene dataset

1 Answers1