I wanted to know if there is any way to oversample the data using pyspark.
I have dataset with target variable of 10 classes. As of Now I am taking each class and oversampling like below to match
transformed_04=transformed.where(F.col('nps_score')==4)
transformed_03=transformed.where(F.col('nps_score')==3)
transformed_02=transformed.where(F.col('nps_score')==2)
transformed_01=transformed.where(F.col('nps_score')==1)
transformed_00=transformed.where(F.col('nps_score')==0)
transformed_04_more_rows=transformed_04.sample(True,11.3,9)
transformed_03_more_rows=transformed_03.sample(True,16.3,9)
transformed_02_more_rows=transformed_03.sample(True,12,9)
And finally joining all dataframes with union all
transformed_04_more_rows.unionAll(transformed_03_more_rows).unionAll(transformed_02_more_rows)
Sampling values I am checking manually . For ex if 4th class has 2000 rows and second class has 10 rows checking manually and providing values 16,12 accordingly as provided in code above
Forgive me about mentioned code is not complete one . Just to give an view I had put. I wanted to know if there is any automated way like SMOTE in pyspark .
I have seen below link , Oversampling or SMOTE in Pyspark
It says my target class has to be only two . If I remove the condition it throws me some datatype issues
Can anyone help me with this implementation in pyspark checking every class and providing sampling values is very painful please help