My data is dataset diamond:
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|carat| cut|color|clarity|depth|table|price| x| y| z|
+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
I have created a function which reads columns carat and returns interval for every value. I need to form a new column with this intervals.
Result should be like:
carat carat_bin
0.23 (0.1)
1.5 (1,2)
My code so far is:
def carat_bin(size) :
if ((df['size'] >0) & (df['size'] <= 1)):
return '[0,1)'
if ((df['size'] >1) & (df['size'] <= 2)):
return '[1,2)'
if ((df['size'] >2) & (df['size'] <= 3)):
return '[2,3)'
if ((df['size'] >3) & (df['size'] <= 4)):
return '[3,4)'
if ((df['size'] >4) & (df['size'] <= 5)):
return '[4,5)'
elif df['size'] :
return '[5, 6)'
spark.udf.register('carat_bin', carat_bin)
tst = diamonds.withColumn("carat_bin", carat_bin(diamonds['carat']))
but what I get is :
Cannot resolve column name "size" among (carat, cut, color, clarity, depth, table, price, x, y, z);
What I am missing here?