8

I am relatively new to spark and I've run into an issue when I try to use python's builtin round() function after importing pyspark functions. It seems to have to do with how I import the pyspark functions but I am not sure what the difference is or why one way would cause issues and the other wouldn't.

Expected behavior:

import pyspark.sql.functions
print(round(3.14159265359,2))
>>> 3.14

Unexpected behavior:

from pyspark.sql.functions import *
print(round(3.14159265359,2))
>>> ERROR

AttributeError                            Traceback (most recent call last)
<ipython-input-1-50155ca4fa82> in <module>()
      1 from pyspark.sql.functions import *
----> 2 print(round(3.1454848383,2))

/opt/spark/python/pyspark/sql/functions.py in round(col, scale)
    503     """
    504     sc = SparkContext._active_spark_context
--> 505     return Column(sc._jvm.functions.round(_to_java_column(col), scale))
    506 
    507 

AttributeError: 'NoneType' object has no attribute '_jvm'
Mark Griggs
  • 317
  • 1
  • 3
  • 13
  • 2
    I think you need to initialize a spark context before using this function. In the error 'NoneType' object has no attribute '_jvm' ==> sc is NoneType ==> SparkContext does not exist. The first case works because it still uses the native round function, if you want to use the pyspark function you would have to call pyspark.sql.functions.round(3.14159265359,2) – Tony Pellerin Sep 28 '18 at 14:53
  • But that is the thing, I don't want to use the pyspark round function. If I do the 'from pyspark.sql.functions import *', it is almost as if pyspark is overloading the round() function...? – Mark Griggs Sep 28 '18 at 14:58
  • @MGriggs *"it is almost as if pyspark is overloading the round() function...?"* Thats **exactly** what it's doing. How else do you expect `import *` to work? Please read [why is `import *` bad?](https://stackoverflow.com/questions/2386714/why-is-import-bad). – pault Sep 28 '18 at 18:10

5 Answers5

8

Import import pyspark.sql.functions as F to avoid conflict.

In this way, you can use all python built-in functions normally and when you want to use pyspark functions, use them as F.round

mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
  • 1
    I put an answer below to how to use a DataFrame. After seeing this answer I realize I misunderstood the question. importing functions as F is common practice. You can also reference python round with `__builtin__.round()` but that's ugly. Deleting my bad answer. – Michael West Sep 28 '18 at 15:15
  • OP mentions specifically in his comment that he doesn't want to use pyspark functions. True, that does makes it ugly and perhaps i think it is better to go with the common practice :) – mayank agrawal Sep 28 '18 at 15:19
5

Don't do import * as it can mess up your namespace.

Pyspark has round function: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.round

So build-in function round is being replaced by pyspark.sql.functions.round

Adelina
  • 10,915
  • 1
  • 38
  • 46
2

If you have a long piece of code where you have used pyspark.sql.functions without any reference like F. Then inorder to use python round exclusively, you can use __builtins__.round() in pyspark code. @michael_west was almost right but the module should be __builtins__ instead of __builtin__. Example code:

from builtins import round

k = round(123.456)
Ruli
  • 2,592
  • 12
  • 30
  • 40
  • 1
    I was about to head down the path of fixing all the places where someone else's code uses `pyspark.sql.functions.round` so I can use the built-in `round`, but had forgotten about the existence of the `builtins` module. On a deadline, and pressed for time, it is way easier to `import builtins` and later call `builtins.round`, so thank you for this answer! – hlongmore Sep 29 '20 at 08:22
0

Same issue here, and I don't want to alias pyspark.sql.functions so just alias the round here.

from pyspark.sql.functions import round as sqlround
0

to resolve the issue use the following code /way

import pyspark.sql.functions as func

Here, use 'func' to let compiler know that use 'round' function from pyspark.sql.functions and not the native one.

Sample usage down below

display(df_film.withColumn('HourLength', func.round(df_film['Length']/60,2)))
Aditya Rathi
  • 59
  • 1
  • 6