Calculate the number of records whose values falls into each bin in SPARK

Question

I have dataframe as below:

------+--------------+
|   sid|first_term_gpa|
+------+--------------+
|100170|           2.0|
|100446|        3.8333|
|100884|           2.0|
|101055|           3.0|
|101094|        3.7333|
|101775|        3.7647|
|102524|        3.8235|
|102798|           3.5|
|102960|        2.8235|
|103357|           3.0|
|103747|        3.8571|
|103902|           3.8|
|104053|        3.1667|
|104064|        1.8462|

and I have created a UDF function

def student_gpa(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[float(gpa)]

with parameter gpa expected to be float

I apply the UDF created above to the first_term_gpa column to create a new column named gpa_bin with code below:

alumni_ft_gpa = first_term_gpa \
.withColumn('gpa_bin', expr('student_gpa(first_term_gpa)'))\
.show()

but it throws me error:

An exception was thrown from a UDF: 'TypeError: list indices must be integers or slices, not float',

What I am missing here?

the error is very clear - you can't use a float as the list index. Just use bins[int(gpa)] instead — mck, Jun 03 '21 at 09:02
@mck I tried and it throws An exception was thrown from a UDF: 'IndexError: list index out of range' — user1997567, Jun 03 '21 at 09:17
Here is how you do that: https://stackoverflow.com/a/67738298/9534390 — pythonic833, Jun 03 '21 at 09:26
@pythonic833 not working I tried. Keep throwing error An exception was thrown from a UDF: 'IndexError: list index out of range' — user1997567, Jun 03 '21 at 09:50
@user1997567: Actually both solutions work perfectly when I try it. Can you please update your question and try one of the solutions I linked to? — pythonic833, Jun 03 '21 at 09:59

pythonic833 · Accepted Answer · 2021-06-03T10:52:11.947

Using imports

Here is a working solution that builds upon your tries:

from pyspark.sql import Row, functions as F
from pyspark.sql.types import StringType   


df = spark.createDataFrame(
[Row(sid=100170, first_term_gpa=2.0),
 Row(sid=100446, first_term_gpa=3.8333),
 Row(sid=100884, first_term_gpa=2.0),
 Row(sid=101055, first_term_gpa=3.0),
 Row(sid=101094, first_term_gpa=3.7333),
 Row(sid=101775, first_term_gpa=3.7647),
 Row(sid=102524, first_term_gpa=3.8235),
 Row(sid=102798, first_term_gpa=3.5),
 Row(sid=102960, first_term_gpa=2.8235),
 Row(sid=103357, first_term_gpa=3.0),
 Row(sid=103747, first_term_gpa=3.8571),
 Row(sid=103902, first_term_gpa=3.8),
 Row(sid=104053, first_term_gpa=3.1667),
 Row(sid=104064, first_term_gpa=1.8462)]
)

@F.udf(StringType())
def student_gpa(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[int(gpa)]

df \
   .withColumn('gpa_bin', student_gpa('first_term_gpa'))\
   .show()

Which outputs

+------+--------------+-------+
|   sid|first_term_gpa|gpa_bin|
+------+--------------+-------+
|100170|           2.0|  [2,3)|
|100446|        3.8333|  [3,4)|
|100884|           2.0|  [2,3)|
|101055|           3.0|  [3,4)|
|101094|        3.7333|  [3,4)|
|101775|        3.7647|  [3,4)|
|102524|        3.8235|  [3,4)|
|102798|           3.5|  [3,4)|
|102960|        2.8235|  [2,3)|
|103357|           3.0|  [3,4)|
|103747|        3.8571|  [3,4)|
|103902|           3.8|  [3,4)|
|104053|        3.1667|  [3,4)|
|104064|        1.8462|  [1,2)|
+------+--------------+-------+

The reason why I convert gpa to an integer is connected to how we built the intervals. E.g. gpa=2.5 is expected to result in bin [2,3) which corresponds to index 2 in the bins list. We achieve this by casting 2.5 to an integer.

Using expr only

from pyspark.sql.functions import expr

def student_gpa2(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[int(gpa)]

spark.udf.register("student_gpa2", student_gpa2)
df.withColumn('new_col', expr("student_gpa2(first_term_gpa)")).show()

with your code replicated I get invalid literal for int() with base 10: 'first_term_gpa' — user1997567, Jun 03 '21 at 10:35
I updated how you can create the df. If you still get errors, can you please paste your complete stacktrace? Also please show the schema of your dataframe — pythonic833, Jun 03 '21 at 10:42
Can I use the same code without importing from pyspark.sql import Row, functions as F from pyspark.sql.types import StringType ? — user1997567, Jun 03 '21 at 10:43
If I have this restriction any way around with aggregation probably? — user1997567, Jun 03 '21 at 10:45
The requirement for udf function is that parameter is expected to be a float representing a student’s GPA and function should return a string representing the specific GPA range that gpa falls into. — user1997567, Jun 03 '21 at 10:51
But that is already the case, the udf is accepting a float, then returning a string. I've updated my solution to show how to do this with using `expr`. Would be appreciated if you could make this restrictions clear in your question next time. — pythonic833, Jun 03 '21 at 10:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233273/discussion-between-pythonic833-and-user1997567). — pythonic833, Jun 03 '21 at 10:56

Calculate the number of records whose values falls into each bin in SPARK

1 Answers1

Using imports

Using expr only