I have a dataframe with thousands of columns that I would like to pass to greatest
function without specifying column names individually. How can I do that?
As an example, I have df
with 3 columns, that I am passing to greatest
, each by specifying df.x, df.y..
and so on.
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
| 4|
+--------+
In the above example I had only 3 columns, but if it were in thousands, it is impossible to mention each one of them. Couple of things I tried didn't work. I am missing some crucial python...
df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns
df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"
df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist