-4

I have a spark dataframe that contain 4 columns:

(col_1, col_2, col_3, col_4) ==> (String, String, Int, Int)

In the data, sometime col_3 is empty, for example:

 col_1|col_2|col_3|col_4
 col_1|col_2||col_4

I want to return a new dataframe that contain just 3 columns, after testing columns 3 and 4:

if col_3 is empty return col_4 else return col_3

To solve it i did this:

>>>
>>> def calculcolumn(col_3, col_4):
...     if (col_3 is None ):
...             return col_4
...     else:
...             return col_3
...
>>>
>>> calculcolumn( ,12)
  File "<stdin>", line 1
    calculcolumn( ,12)
                  ^
SyntaxError: invalid syntax
>>>

But it throws SyntaxError, how can I resolve it?

Maor Refaeli
  • 2,417
  • 2
  • 19
  • 33
icou
  • 17
  • 6
  • does your code have space before your `def`? Spaces and tabs are important in Python unlike other popular languages. Please make sure what you posted is the same as your code you are trying to run. – MooingRawr Aug 28 '18 at 14:19
  • @MooingRawr Thank you I edited my question. – icou Aug 28 '18 at 14:23
  • 1
    What are you expecting `calculcolumn( ,12)` to do? It's a `SyntaxError` because you can't just ignore an argument. Do you mean to pass `calculcolumn(None, 12)`? – FHTMitchell Aug 28 '18 at 14:23
  • @FHTMitchell I have a spark Dataframe – icou Aug 28 '18 at 14:25
  • @FHTMitchell I gived you an example of the structure of the dataframe, sometime the col_3 is empty between too || – icou Aug 28 '18 at 14:27
  • `combine_first` ? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.combine_first.html – Chris Adams Aug 28 '18 at 14:28

2 Answers2

2

You are getting a SyntaxError because, well, it's a syntax error.
You must pass the first argument as well.

def calculcolumn(col_3, col_4):
    if (col_3 is None ):
        return col_4
    else:
        return col_3

calculcolumn(None, 12)

You can also use kwargs and do this:

def calculcolumn(col_3=None, col_4=None):
    if (col_3 is None ):
        return col_4
    else:
        return col_3

calculcolumn(col_4=12)
Maor Refaeli
  • 2,417
  • 2
  • 19
  • 33
0

If you are using a pyspark dataframe you should be using native pyspark functions. To solve your problem you can do the following to create a new column based on whether col3 is None:

df = df.withColumn('new_col', func.when(func.col("col3").isNull(), func.col("col4")).otherwise(func.col("col3"))

This function will create a new column where if col3 is null will use col4 otherwise it will use col3.

vielkind
  • 2,840
  • 1
  • 16
  • 16
  • To be clear, you should include in the answer; ```import pyspark.sql.functions as func``` – hwrd Nov 05 '18 at 14:24