1

I have generated a UserDefinedFunction like this:

def function1(instance):
    if(instance['Atr1'] == '--'):
        return '++'
    else:
        return '++++'

from pyspark.sql.functions import UserDefinedFunction

udf = UserDefinedFunction(lambda instance: function1(instance), StringType())

udf(df)

Where my dataframe has some attributes: 'Atr1', 'Atr2', 'AtrN'...

I get the error:

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

I want to get a column with only that atribute. How could I do it?

jartymcfly
  • 1,945
  • 9
  • 30
  • 51

1 Answers1

1

You can call the udf on the column you need ,

from pyspark.sql.functions import UserDefinedFunction
udf = UserDefinedFunction(lambda instance: instance, StringType())
df.select(udf('Atr1')).collect()

To create an attribute based on existing one using a simple function as above, we don't need a udf. we can do,

from pyspark.sql import functions as F
df.withColumn('Atr4',F.when(df.Atr1 == '--','++').otherwise('++++')).show()

or, if same logic is used to create many attributes, we can move them as udf and use them,

 udf = UserDefinedFunction(lambda attr: F.when(attr == '--','++').otherwise('++++'), StringType())
 df.select('Atr1','Atr2','Atr3',udf('Atr1').alias('Atr4'),udf('Atr2').alias('Atr5')).show()

and so on.

Suresh
  • 5,678
  • 2
  • 24
  • 40
  • And what if I have a function where I make some operations over different attributes of the instance? I mean, if I have a function called "func(instance)" that makes some operations over the instance's attributes and I call it "udf = UserDefinedFunction(lambda instance: func(instance), StringType())". Let the "func(instance)" code be: "if(instance['Atr1'] > 0): return true" – jartymcfly Jul 13 '17 at 12:43
  • I edited the question for your better understanding. – jartymcfly Jul 13 '17 at 13:03
  • So, based on the Atr1, you want to create a new attribute or change Atr1 value ?? – Suresh Jul 13 '17 at 13:23
  • I want to create a new Attribute. – jartymcfly Jul 13 '17 at 14:07
  • Okey, nice answer. I still want to know how could I manage that problem if the method were more complex (many loops and ifs). – jartymcfly Jul 14 '17 at 08:08
  • you can pass the instance and attributes to method and make them work accordingly. Check this,https://stackoverflow.com/questions/45035940/pivot-multiple-columns-pyspark. Might give you some idea. – Suresh Jul 14 '17 at 08:56