How to handle exception in Pyspark for data science problems

Question

How to identify which kind of exception below renaming columns will give and how to handle it in pyspark:

def rename_columnsName(df, columns):   #provide names in dictionary format
if isinstance(columns, dict):     
    for old_name, new_name in columns.items():
        df = df.withColumnRenamed(old_name, new_name)
    return df.show()
else:
    raise ValueError("'columns' should be a dict, like {'old_name':'new_name', 'old_name_one more':'new_name_1'}")

how to test it by generating a exception with a datasets.

What kind of handling do you want to do? Maybe you can check before calling withColumnRenamed if the column exists? This will allow you to do required handling for negative cases and handle those cases separately. — UtkarshSahu, Aug 16 '20 at 11:32

score 0 · Answer 1 · answered Jul 31 '20 at 12:29

Here's an example of how to test a PySpark function that throws an exception. In this example, we're verifying that an exception is thrown if the sort order is "cats".

def it_throws_an_error_if_the_sort_order_is_invalid(spark):
    source_df = spark.create_df(
        [
            ("jose", "oak", "switch"),
            ("li", "redwood", "xbox"),
            ("luisa", "maple", "ps4"),
        ],
        [
            ("name", StringType(), True),
            ("tree", StringType(), True),
            ("gaming_system", StringType(), True),
        ]
    )
    with pytest.raises(ValueError) as excinfo:
        quinn.sort_columns(source_df, "cats")
    assert excinfo.value.args[0] == "['asc', 'desc'] are the only valid sort orders and you entered a sort order of 'cats'"

Notice that the test is verifying the specific error message that's being provided.

You can provide invalid input to your rename_columnsName function and validate that the error message is what you expect.

Some other tips:

follow the examples to rename columns here and here. You shouldn't call withColumnRenamed in a loop.
Write DataFrame transformations using the standard transform format so they can be chained with DataFrame#transform
use pytest-describe to organize these types of tests
Check out this test file for a bunch of examples

score 0 · Accepted Answer · answered Aug 16 '20 at 08:46

0

I found the solution of this question, we can handle exception in Pyspark similarly like python. eg :

def rename_columnsName(df, columns):#provide names in dictionary format
try:

   if isinstance(columns, dict):
      for old_name, new_name in columns.items():     
    
           df = df.withColumnRenamed(old_name, new_name)
return df.show()
   else:
         raise ValueError("'columns' should be a dict, like {'old_name':'new_name', 
                'old_name_one more':'new_name_1'}")
except Exception as e:
      print(e)

answered Aug 16 '20 at 08:46

Gamefic

59
8

AFAIK, we should not add `Exception` in the `except` clause. If we remove the outer try clause, this function would be easier to read and understand. Thus, we should handle the exceptions when calling this function. What would be the point in raising an error if we don't do that? – DRTorresRuiz May 31 '23 at 13:37

How to handle exception in Pyspark for data science problems

2 Answers2