I am trying to read in the PGN files from the lichess database: https://database.lichess.org/. The 2013-01 file is 16.1MB and reads in around 8 seconds. The 2014-07 is 176mb and still hasn't finished after 16 minutes. This is concerning as I really need to use the most recent file which is 27.3GB for my final output.
def parse_game_file(game_file):
from pyspark.sql import functions as F
load_start = perf_counter()
basefile = spark.sparkContext.wholeTextFiles(game_file, 10).toDF()
load_stop = perf_counter()
print("Time to load file:", round(load_stop - load_start,2))
df = parse_game_file('lichess_db_standard_rated_2014-07.pgn')
It hangs on the line basefile = spark.sparkContext.wholeTextFiles(game_file, 10).toDF()
I am running this on Google Colab, and do have access to Google Cloud Platform which I assume will be faster but I am surprised that Google Colab can't deal with a file that is only 176mb.
Thanks.