1

I have dataframe as below:

------+--------------+
|   sid|first_term_gpa|
+------+--------------+
|100170|           2.0|
|100446|        3.8333|
|100884|           2.0|
|101055|           3.0|
|101094|        3.7333|
|101775|        3.7647|
|102524|        3.8235|
|102798|           3.5|
|102960|        2.8235|
|103357|           3.0|
|103747|        3.8571|
|103902|           3.8|
|104053|        3.1667|
|104064|        1.8462|

and I have created a UDF function

def student_gpa(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[float(gpa)]

with parameter gpa expected to be float

I apply the UDF created above to the first_term_gpa column to create a new column named gpa_bin with code below:

alumni_ft_gpa = first_term_gpa \
.withColumn('gpa_bin', expr('student_gpa(first_term_gpa)'))\
.show()

but it throws me error:

An exception was thrown from a UDF: 'TypeError: list indices must be integers or slices, not float', 

What I am missing here?

pythonic833
  • 3,054
  • 1
  • 12
  • 27
user1997567
  • 439
  • 4
  • 19

1 Answers1

0

Using imports

Here is a working solution that builds upon your tries:

from pyspark.sql import Row, functions as F
from pyspark.sql.types import StringType   


df = spark.createDataFrame(
[Row(sid=100170, first_term_gpa=2.0),
 Row(sid=100446, first_term_gpa=3.8333),
 Row(sid=100884, first_term_gpa=2.0),
 Row(sid=101055, first_term_gpa=3.0),
 Row(sid=101094, first_term_gpa=3.7333),
 Row(sid=101775, first_term_gpa=3.7647),
 Row(sid=102524, first_term_gpa=3.8235),
 Row(sid=102798, first_term_gpa=3.5),
 Row(sid=102960, first_term_gpa=2.8235),
 Row(sid=103357, first_term_gpa=3.0),
 Row(sid=103747, first_term_gpa=3.8571),
 Row(sid=103902, first_term_gpa=3.8),
 Row(sid=104053, first_term_gpa=3.1667),
 Row(sid=104064, first_term_gpa=1.8462)]
)

@F.udf(StringType())
def student_gpa(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[int(gpa)]

df \
   .withColumn('gpa_bin', student_gpa('first_term_gpa'))\
   .show()

Which outputs

+------+--------------+-------+
|   sid|first_term_gpa|gpa_bin|
+------+--------------+-------+
|100170|           2.0|  [2,3)|
|100446|        3.8333|  [3,4)|
|100884|           2.0|  [2,3)|
|101055|           3.0|  [3,4)|
|101094|        3.7333|  [3,4)|
|101775|        3.7647|  [3,4)|
|102524|        3.8235|  [3,4)|
|102798|           3.5|  [3,4)|
|102960|        2.8235|  [2,3)|
|103357|           3.0|  [3,4)|
|103747|        3.8571|  [3,4)|
|103902|           3.8|  [3,4)|
|104053|        3.1667|  [3,4)|
|104064|        1.8462|  [1,2)|
+------+--------------+-------+

The reason why I convert gpa to an integer is connected to how we built the intervals. E.g. gpa=2.5 is expected to result in bin [2,3) which corresponds to index 2 in the bins list. We achieve this by casting 2.5 to an integer.

Using expr only

from pyspark.sql.functions import expr

def student_gpa2(gpa):
    bins = ['[0,1)', '[1,2)', '[2,3)', '[3,4)']
    return bins[int(gpa)]

spark.udf.register("student_gpa2", student_gpa2)
df.withColumn('new_col', expr("student_gpa2(first_term_gpa)")).show()
pythonic833
  • 3,054
  • 1
  • 12
  • 27
  • with your code replicated I get invalid literal for int() with base 10: 'first_term_gpa' – user1997567 Jun 03 '21 at 10:35
  • 1
    I updated how you can create the df. If you still get errors, can you please paste your complete stacktrace? Also please show the schema of your dataframe – pythonic833 Jun 03 '21 at 10:42
  • Can I use the same code without importing from pyspark.sql import Row, functions as F from pyspark.sql.types import StringType ? – user1997567 Jun 03 '21 at 10:43
  • 1
    No, you can't. These imports are needed – pythonic833 Jun 03 '21 at 10:44
  • If I have this restriction any way around with aggregation probably? – user1997567 Jun 03 '21 at 10:45
  • The requirement for udf function is that parameter is expected to be a float representing a student’s GPA and function should return a string representing the specific GPA range that gpa falls into. – user1997567 Jun 03 '21 at 10:51
  • 1
    But that is already the case, the udf is accepting a float, then returning a string. I've updated my solution to show how to do this with using `expr`. Would be appreciated if you could make this restrictions clear in your question next time. – pythonic833 Jun 03 '21 at 10:53
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233273/discussion-between-pythonic833-and-user1997567). – pythonic833 Jun 03 '21 at 10:56