I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe.
My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows.
In the example below I'm turning the dataframe into a regular Python list from which I can call the API on batches of 50.
Any ideas how I could do this without going back to a Python list?
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from pyspark.sql import SparkSession
import os
spark = SparkSession\
.builder\
.appName("GetArtists")\
.getOrCreate()
df = spark.read.format('jdbc') \
.option("url", "jdbc:mysql://"+os.getenv("DB_SERVER")+":"+os.getenv("DB_PORT")+"/spotify_metadata")\
.option("user", os.getenv("DB_USER"))\
.option("password", os.getenv("DB_PW"))\
.option("query", "SELECT artist_id FROM artists")\
.load()
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())
ids = [row['artist_id'] for row in df.collect()]
batch_size = 50
for i in range(0,len(ids), batch_size):
artists = sp.artists( ids[i:i+batch_size] )
# process the JSON response
I thought about using foreach
and calling the API for each id, but this results in unnecessary requests. Also the results are stored back in the database, which means that I am writing many single rows to the database.