I have a fairly involved process of creating a pyspark dataframe, converting it to a pandas dataframe, and outputting the result to a flat file. I am not sure at which point the error is introduced, so I'll describe the whole process.
Starting out I have a pyspark dataframe that contains pairwise similarity for sets of ids. It looks like this:
+------+-------+-------------------+
| ID_A| ID_B| EuclideanDistance|
+------+-------+-------------------+
| 1| 1| 0.0|
| 1| 2|0.13103884200454394|
| 1| 3| 0.2176246463836219|
| 1| 4| 0.280568636550471|
...
I'like to group it by ID_A, sort each group by EuclideanDistance, and only grab the top N pairs for each group. So first I do this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
window = Window.partitionBy(df['ID_A']).orderBy(df_sim['EuclideanDistance'])
result = (df.withColumn('row_num', row_number().over(window)))
I make sure ID_A = 1 is still in the "result" dataframe. Then I do this to limit each group to just 20 rows:
result1 = result.where(result.row_num<20)
result1.toPandas().to_csv("mytest.csv")
and ID_A = 1 is NOT in the resultant .csv file (although it's still there in result1). Is there a problem somewhere in this chain of conversions that could lead to a loss of data?