0

I have a dataframe with thousands of columns that I would like to pass to greatest function without specifying column names individually. How can I do that?

As an example, I have df with 3 columns, that I am passing to greatest, each by specifying df.x, df.y.. and so on.

df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
|       4|
+--------+

In the above example I had only 3 columns, but if it were in thousands, it is impossible to mention each one of them. Couple of things I tried didn't work. I am missing some crucial python...

df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns

df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"

df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Bala
  • 11,068
  • 19
  • 67
  • 120
  • use `pandas`. You can use `apply()` with pandas to get max value from each row (if that is what you looking for) – Sociopath Feb 08 '18 at 11:06
  • 1
    Haven't tried this one but it would make sense: `df.select(greatest(*[col(c) for c in df.columns]).alias('greatest')).show()` – mkaran Feb 08 '18 at 11:07
  • @mkaran - It works. But what does the `*` mean here? – Bala Feb 08 '18 at 11:13
  • 1
    The `*` unpacks the list so that the `greatest` is called with positional arguments instead of a list. – mkaran Feb 08 '18 at 11:21

1 Answers1

1

greatest supports positional arguments*

pyspark.sql.functions.greatest(*cols)

(this is why you can greatest(df.x,df.y,df.z)) so just

df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
df.select(greatest(*df.columns))

* Quoting Python glossary, positional argument is

  • ... an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable preceded by *. For example, 3 and 5 are both positional arguments in the following calls:

    complex(3, 5)
    complex(*(3, 5))
    

Furthermore:

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • `*cols` or `*df.columns` - Does it return a list or comma separated columns as expected by `greatest`? I am always getting confused with it. – Bala Feb 08 '18 at 11:12