1

Total Result of all rounds of Tournament for that player is considered as that player's Score/Result.

Schema:

 |-- game_id: string (nullable = true)
 |-- game_order: integer (nullable = true)
 |-- event: string (nullable = true)
 |-- site: string (nullable = true)
 |-- date_played: string (nullable = true)
 |-- round: double (nullable = true)
 |-- white: string (nullable = true)
 |-- black: string (nullable = true)
 |-- result: string (nullable = true)
 |-- white_elo: integer (nullable = true)
 |-- black_elo: integer (nullable = true)
 |-- white_title: string (nullable = true)
 |-- black_title: string (nullable = true)
 |-- winner: string (nullable = true)
 |-- winner_elo: integer (nullable = true)
 |-- loser: string (nullable = true)
 |-- loser_elo: integer (nullable = true)
 |-- winner_loser_elo_diff: integer (nullable = true)
 |-- eco: string (nullable = true)
 |-- date_created: string (nullable = true)
 |-- tournament_name: string (nullable = true)

Sample DaraFrame:

+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
|             game_id|game_order|   event|      site|date_played|round|           white|           black| result|white_elo|black_elo|white_title|black_title|   winner|winner_elo|           loser|loser_elo|winner_loser_elo_diff|eco|        date_created|tournament_name|
+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
|86e0b7f5-7b94-4ae...|         1|WCh 2021| Dubai UAE| 2021.11.26|  1.0|Nepomniachtchi,I|       Carlsen,M|1/2-1/2|     2782|     2855|       null|       null|     draw|      null|            draw|     null|                    0|C88|2022-07-22T22:33:...| WorldChamp2021|
|dc4a10ab-54cf-49d...|         2|WCh 2021| Dubai UAE| 2021.11.27|  2.0|       Carlsen,M|Nepomniachtchi,I|1/2-1/2|     2855|     2782|       null|       null|     draw|      null|            draw|     null|                    0|E06|2022-07-22T22:33:...| WorldChamp2021|
|f042ca37-8899-488...|         3|WCh 2021| Dubai UAE| 2021.11.28|  3.0|Nepomniachtchi,I|       Carlsen,M|1/2-1/2|     2782|     2855|       null|       null|     draw|      null|            draw|     null|                    0|C88|2022-07-22T22:33:...| WorldChamp2021|
|f70e4bbc-21e3-46f...|         4|WCh 2021| Dubai UAE| 2021.11.30|  4.0|       Carlsen,M|Nepomniachtchi,I|1/2-1/2|     2855|     2782|       null|       null|     draw|      null|            draw|     null|                    0|C42|2022-07-22T22:33:...| WorldChamp2021|
|c941c323-308a-4c8...|         5|WCh 2021| Dubai UAE| 2021.12.01|  5.0|Nepomniachtchi,I|       Carlsen,M|1/2-1/2|     2782|     2855|       null|       null|     draw|      null|            draw|     null|                    0|C88|2022-07-22T22:33:...| WorldChamp2021|
|58e83255-93bb-4d5...|         6|WCh 2021| Dubai UAE| 2021.12.03|  6.0|       Carlsen,M|Nepomniachtchi,I|    1-0|     2855|     2782|       null|       null|Carlsen,M|      2855|Nepomniachtchi,I|     2782|                   73|D02|2022-07-22T22:33:...| WorldChamp2021|
|29181d93-73f4-4fb...|         7|WCh 2021| Dubai UAE| 2021.12.04|  7.0|Nepomniachtchi,I|       Carlsen,M|1/2-1/2|     2782|     2855|       null|       null|     draw|      null|            draw|     null|                    0|C88|2022-07-22T22:33:...| WorldChamp2021|
|8a4ccd8c-d437-429...|         8|WCh 2021| Dubai UAE| 2021.12.05|  8.0|       Carlsen,M|Nepomniachtchi,I|    1-0|     2855|     2782|       null|       null|Carlsen,M|      2855|Nepomniachtchi,I|     2782|                   73|C43|2022-07-22T22:33:...| WorldChamp2021|
|55a122db-27d1-495...|         9|WCh 2021| Dubai UAE| 2021.12.07|  9.0|Nepomniachtchi,I|       Carlsen,M|    0-1|     2782|     2855|       null|       null|Carlsen,M|      2855|Nepomniachtchi,I|     2782|                   73|A13|2022-07-22T22:33:...| WorldChamp2021|
|1f900d18-5ea3-4f4...|        10|WCh 2021| Dubai UAE| 2021.12.08| 10.0|       Carlsen,M|Nepomniachtchi,I|1/2-1/2|     2855|     2782|       null|       null|     draw|      null|            draw|     null|                    0|C42|2022-07-22T22:33:...| WorldChamp2021|

My code looks like this. I think it's messed up. Am I supposed to do sum somewhere?

winners = df_history_info.filter(df_history_info['winner'] != "draw").groupBy("tournament_name").agg({"winner":"max"}).show() 

I'm getting this result but it is incorrect in many cases.

+---------------+--------------------+
|tournament_name|         max(winner)|
+---------------+--------------------+
| WorldChamp2004|              Leko,P|
| WorldChamp1894|   Steinitz, William|
| WorldChamp2013|     Carlsen, Magnus|
|  FideChamp2000|       Yermolinsky,A|
| WorldChamp2007|           Svidler,P|
|  FideChamp1993|       Timman, Jan H|
|WorldChamp1910b|     Lasker, Emanuel|
| WorldChamp1921|Capablanca, Jose ...|
| WorldChamp1958|    Smyslov, Vassily|
| WorldChamp1981|  Kortschnoj, Viktor|
| WorldChamp1961|         Tal, Mihail|
| WorldChamp1978|  Kortschnoj, Viktor|
| WorldChamp1960|         Tal, Mihail|
| WorldChamp1948|    Smyslov, Vassily|
| WorldChamp1929|    Bogoljubow, Efim|
| WorldChamp1934|    Bogoljubow, Efim|
| WorldChamp1986|      Kasparov, Gary|
|   PCAChamp1995|      Kasparov, Gary|
| WorldChamp1886|Zukertort, Johann...|
| WorldChamp1907|     Lasker, Emanuel|
+---------------+--------------------+
AutumnRain
  • 25
  • 4
  • can you provide a sample of your pyspark dataframe? it would be helpful to see what some of the column values look like (see [here](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples)). – Derek O Sep 25 '22 at 02:04
  • 1
    @DerekO I have added the sample dataframe. I'll take a look at your answer and see if I can fix it. Thank you! – AutumnRain Sep 25 '22 at 06:15
  • thanks for the update! so the `winner` column contains a string that says either `"draw"` or the name of the winning player. in that case, i think my answer should work as it gets the max count of the winner when grouping by `tournament_name` – Derek O Sep 25 '22 at 06:17
  • 1
    Thanks! It worked. I grouped and then filtered the draws and used the WindowSpec @DerekO – AutumnRain Sep 25 '22 at 07:00
  • glad to hear my answer helped, @AutumnRain ! – Derek O Sep 25 '22 at 15:48

1 Answers1

0

Since the winner column contains either the winning player's name or the word "draw" which you've filtered out, then this means the operation .agg({"winner":"max"}) will return the max of a string. This is why Zukertort, Johann... appears as the winner of WorldChamp1886 instead of Steinitz..., and Yermolinksky,A appears as in the winner in the 128 person field in the FideChamp2000.

Here is an example of something you could try with a spark dataframe that looks like the following:

df = spark.createDataFrame(
    [
        ("WC1", "A"),
        ("WC1", "B"),
        ("WC1", "A"),
        ("WC1", "A"),
        ("WC1", "A"),
        ("WC1", "B"),
        ("WC1", "A"),
        ("WC1", "B"),
        ("WC2", "F"),
        ("WC2", "F"),
        ("WC2", "F"),
        ("WC2", "D"),
        ("WC2", "D"),
        ("WC2", "E"),
        ("WC2", "F"),
        ("WC2", "F"),
    ],
    ["tournament_name", "winner"]  # add your column names here
)

And you have a situation like this where you want to determine who wins each tournament by the most number of times their name appears in the winner column.

+---------------+------+
|tournament_name|winner|
+---------------+------+
|            WC1|     A|
|            WC1|     B|
|            WC1|     A|
|            WC1|     A|
|            WC1|     A|
|            WC1|     B|
|            WC1|     A|
|            WC1|     B|
|            WC2|     F|
|            WC2|     F|
|            WC2|     F|
|            WC2|     D|
|            WC2|     D|
|            WC2|     E|
|            WC2|     F|
|            WC2|     F|
+---------------+------+

You can do a groupby count on tournament_name and winner:

d = df.groupby(["tournament_name","winner"]).count()

And that gives you this pyspark dataframe:

+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
|            WC1|     B|    3|
|            WC1|     A|    5|
|            WC2|     F|    5|
|            WC2|     D|    2|
|            WC2|     E|    1|
+---------------+------+-----+

Then following this example, you could create a WindowSpec object that partitions by tournament_name, and sorts by in descending order of the count column, and apply it to d:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

windowDept = Window.partitionBy("tournament_name").orderBy(col("count").desc())
d.withColumn("row",row_number().over(windowDept)) \
  .filter(col("row") == 1).drop("row") \
  .show()

Final result:

+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
|            WC1|     A|    5|
|            WC2|     F|    5|
+---------------+------+-----+
Derek O
  • 16,770
  • 4
  • 24
  • 43