Reading files with sparkContext.wholeTextFiles is very slow

Question

I am trying to read in the PGN files from the lichess database: https://database.lichess.org/. The 2013-01 file is 16.1MB and reads in around 8 seconds. The 2014-07 is 176mb and still hasn't finished after 16 minutes. This is concerning as I really need to use the most recent file which is 27.3GB for my final output.

def parse_game_file(game_file):
    from pyspark.sql import functions as F

    load_start = perf_counter()
    basefile = spark.sparkContext.wholeTextFiles(game_file, 10).toDF()
    load_stop = perf_counter()
    print("Time to load file:", round(load_stop - load_start,2))

df = parse_game_file('lichess_db_standard_rated_2014-07.pgn')

It hangs on the line basefile = spark.sparkContext.wholeTextFiles(game_file, 10).toDF()

I am running this on Google Colab, and do have access to Google Cloud Platform which I assume will be faster but I am surprised that Google Colab can't deal with a file that is only 176mb.

Thanks.

Well, there's a lot going on here. The 2014.07.pgn.bz2 file is 176MB compressed, but when decompressed, the PGN file just over a gigabyte. It took about 15 seconds for me to download the file, and about 15 seconds to do the bunzip2 decompression. Only then can you start to read the games. Are you actually trying to store these multigigabyte files as record in an SQL database? — Tim Roberts, Apr 27 '21 at 17:49
The compressiond ratio seems pretty constant, so that 27GB most recent file is likely to result in a 160GB PGN when uncompressed. You certainly can't store that in memory. Do you have room in your cloud instance storage? — Tim Roberts, Apr 27 '21 at 17:50
Downloading the 2013-01 file took 0.3 seconds. Unzipping the file took 4.3seconds. wholeTextFiles() took 0.03 seconds. I split the .toDF() line off and it is that line where it is hanging. Is this because I have not specified the schema of the RDD so it is having to infer it before it can transform it to a dataframe? — BlueTurtle, Apr 27 '21 at 18:01
Those files cannot be converted to dataframes. Have you looked inside them? The format is nowhere NEAR regular enough. What are you hoping to end up with here? — Tim Roberts, Apr 27 '21 at 18:21
It can be done using some selectExpr magic that another user helped me with on an old question: https://stackoverflow.com/a/67056741/4367851 I have had it working as a spark dataframe for a few weeks now and the analysis is working fine, it's just now come to the point where I tried to use the larger input files that it breaks down. — BlueTurtle, Apr 27 '21 at 18:24
It's too much data. You're going to have to convert these in chunks and accumulate the results in an SQL backend somewhere. You cannot hope to hold 160GB of data in memory. — Tim Roberts, Apr 27 '21 at 18:28
Hmm ok. Maybe I am being naive but why is it struggling so much with the 176mb/1gb file though? — BlueTurtle, Apr 27 '21 at 18:29

Reading files with sparkContext.wholeTextFiles is very slow

0 Answers0