0

I wanted to know if there is any way to oversample the data using pyspark.

I have dataset with target variable of 10 classes. As of Now I am taking each class and oversampling like below to match

transformed_04=transformed.where(F.col('nps_score')==4)
transformed_03=transformed.where(F.col('nps_score')==3)
transformed_02=transformed.where(F.col('nps_score')==2)
transformed_01=transformed.where(F.col('nps_score')==1)
transformed_00=transformed.where(F.col('nps_score')==0)

transformed_04_more_rows=transformed_04.sample(True,11.3,9)
transformed_03_more_rows=transformed_03.sample(True,16.3,9)
transformed_02_more_rows=transformed_03.sample(True,12,9)

And finally joining all dataframes with union all

transformed_04_more_rows.unionAll(transformed_03_more_rows).unionAll(transformed_02_more_rows)

Sampling values I am checking manually . For ex if 4th class has 2000 rows and second class has 10 rows checking manually and providing values 16,12 accordingly as provided in code above

Forgive me about mentioned code is not complete one . Just to give an view I had put. I wanted to know if there is any automated way like SMOTE in pyspark .

I have seen below link , Oversampling or SMOTE in Pyspark

It says my target class has to be only two . If I remove the condition it throws me some datatype issues

Can anyone help me with this implementation in pyspark checking every class and providing sampling values is very painful please help

Naveen Srikanth
  • 739
  • 3
  • 11
  • 23

1 Answers1

0

check out the sampleBy function of spark, this enables us stratified samplint. https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=sampleby#pyspark.sql.DataFrame.sampleBy

in your case for each of the class you can provide the fraction of sample that you want in a dictionary and use it in sampleBy, try it out. To decide the fraction, you can do an aggregation count based on your target column , normalize to (0,1) and tune it.

Raghu
  • 1,644
  • 7
  • 19
  • Thanks raghu :). sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0) Here they are telling for class/key 0 it is 0.1 and class/key with 0.2 so how should I consider this as percentage ?. If I need to repeat class 10 as twenty times shoudl I give 0.2 or 2.0 ? – Naveen Srikanth Jul 02 '20 at 18:51
  • Ah, now I get it.. presently pyspark supports sampling only in range (0,1). Just a hack- sample 2 times ( maybe in a loop) with 1 for needed class and 0 for others and union them. BTW, did the indextostring method work out? – Raghu Jul 02 '20 at 18:56
  • Actually I got stuck up there as well. I am using random forest for feature importance during this phase label index has been formed using stringIndex. After getting feature importance. I am building lr model . During this phase when I tried to build the pipeline with passing labe_indexer as paramter to pipeline . It was throwing me error label column already present – Naveen Srikanth Jul 02 '20 at 19:01
  • Just a hack- sample 2 times ( maybe in a loop) with 1 for needed class and 0 for others and union them. BTW, did the indextostring method work out . This one I dindn't understand . Should I check as brute force with 0.2/0.1 values to match with higher score – Naveen Srikanth Jul 02 '20 at 19:02
  • For string indexer,check value of outputCol. If this column is already in input data, then you get error. For the sample by, when you give 1 it means it chooses all the data with that class. Say you have 10 samples with label class as 10, then giving 0.5 will choose 5 samples. – Raghu Jul 02 '20 at 19:18
  • Can you pls tell how to remove duplicate columns, There are lots of links not finding proper one. I have dataframe with a,b,b . I need to retain one b column either first or second one – Naveen Srikanth Jul 04 '20 at 22:24
  • 1
    df.select(list(set(df.columns))) – Raghu Jul 05 '20 at 02:06