0

I have got some experience in pyspark. When our team is migrating the Spark project from python to C# (.Net for Spark). I'm encountering problems:

Suppose we have got a Spark dataframe df with an existing column as col1.

In pyspark, I could do something like:

df = df.withColumn('new_col_name', when((df.col1 <= 5), lit('Group A')) \
                    .when((df.col1 > 5) & (df.col1 <= 8), lit('Group B')) \
                    .when((df.col1 > 8), lit('Group C'))) 

The question is how to do the equivalent in C#?

I've tried many things but still getting Exceptions when using the When() method. For example, the following code would generate the exception:

df = df.WithColumn("new_col_name", df.Col("col1").When(df.Col("col1").EqualTo(3), Functions.Lit("Group A")));

Exception:

[MD2V4P4C] [Error] [JvmBridge] java.lang.IllegalArgumentException: when() can only be applied on a Column previously generated by when() function

Searched around and didn't find many examples on .Net for Spark. Any help would be much appreciated.

MiffyW
  • 21
  • 3

1 Answers1

0

I think the problem is that you need to call the function When that isn't a member function on the Col object as the first call and then the version of When that is a member function every call after that (in this column) so:

var spark = SparkSession.Builder().GetOrCreate();
var df = spark.Range(100);

df.WithColumn("new_col_name", 
        When(Col("Id").Lt(5), Lit("Group A"))
            .When(Col("Id").Between(5, 8), Lit("Group B"))
            .When(Col("Id").Gt(8), Lit("Group C"))
    ).Show();

You. can also use >, =, <, etc like this but personally I prefer the more explicit one above:

df.WithColumn("new_col_name", 
    When(Col("Id") < 5, Lit("Group A"))
        .When(Col("Id") >= 5 & Col("Id") <=8 , Lit("Group B"))
        .When(Col("Id") > 8, Lit("Group C"))
).Show();
Ed Elliott
  • 6,666
  • 17
  • 32
  • Thank you @Ed Elliott for the reply. :) Just I haven't got it working yet even following that. It seems that the syntax is forcing me to have a 'Col()' before calling the When() method. – MiffyW Apr 11 '22 at 05:24