I have the following data in a dataframe
col1 col2 col3 col4
1 desc1 v1 v3
2 desc2 v4 v2
1 desc1 v4 v2
2 desc2 v1 v3
I need only the first row of each unique combination of col1,col2 like below
Expected Output:
col1 col2 col3 col4
1 desc1 v1 v3
2 desc2 v4 v2
How can I achieve this in pyspark (version 1.3.1)?
I tried and achieved the same by converting the dataframe into an rdd and then applying map and reduceByKey functions and then converting back the result rdd into dataframe. Is there any other way to perform the above operation using dataframe functions?