How to get the min of each row in PySpark DataFrame

Asked Jan 28 '17 at 05:59

Active Jan 28 '17 at 09:55

Viewed 95 times

I want to calculate min of each row in PySpark DataFrame.

In NumPy, it can be write

df.min(axis=1)

but I don't know how to do the same thing in PySpark DataFrame.

e.g. I create dataframe (my real data is approx. 1,000,000rows * 1,000cols):

df = sqlContext.createDataFrame([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)], ("c1", "c2", "c3"))


+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10|  1|
|200|  2| 20|
|  3| 30|300|
|400| 40|  4|
+---+---+---+

and I want output below:

+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10|  1|  1|
|200|  2| 20|  2|
|  3| 30|300|  3|
|400| 40|  4|  4|
+---+---+---+---+

edited Jan 28 '17 at 09:55

mrsrinivas

34,112
13
125
125

asked Jan 28 '17 at 05:59

iru

pls add the code and sample dataframe – mrsrinivas Jan 28 '17 at 06:12
Thanks, I add code. – iru Jan 28 '17 at 06:46
AFAIK, there is not direct way to achieve this. If you create a table with values of 1000 columns into one, then there is a chance with UDF. – mrsrinivas Jan 28 '17 at 09:54
I see. Thanks a lot! – iru Jan 29 '17 at 02:05

How to get the min of each row in PySpark DataFrame

0 Answers0