Efficiently batching Spark dataframes to call an API

Question

I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe.

My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows.

In the example below I'm turning the dataframe into a regular Python list from which I can call the API on batches of 50.

Any ideas how I could do this without going back to a Python list?

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from pyspark.sql import SparkSession
import os

spark = SparkSession\
        .builder\
        .appName("GetArtists")\
        .getOrCreate()

df = spark.read.format('jdbc') \
    .option("url", "jdbc:mysql://"+os.getenv("DB_SERVER")+":"+os.getenv("DB_PORT")+"/spotify_metadata")\
    .option("user", os.getenv("DB_USER"))\
    .option("password", os.getenv("DB_PW"))\
    .option("query", "SELECT artist_id FROM artists")\
    .load()

sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

ids = [row['artist_id'] for row in df.collect()]

batch_size = 50
for i in range(0,len(ids), batch_size):
    artists = sp.artists( ids[i:i+batch_size] )

    # process the JSON response

I thought about using foreach and calling the API for each id, but this results in unnecessary requests. Also the results are stored back in the database, which means that I am writing many single rows to the database.

score 4 · Accepted Answer · answered Jun 03 '20 at 06:53

4

If you want to divide the dataframe based on the row number then you can do it like:

from pyspark.sql import functions as f
from pyspark.sql import Window

df = df.withColumn('row_num', f.row_number().over(Window.orderBy(f.lit(1))))
len = df.count()

for i in range(0,len, 50):
    df = df.filter(f.col('row_num')>=i & f.col('row_num')<=i+50)
    #api logic goes here

But if you can pass the df to the api directly then pass df or collect df which will have only 50 values each time.

answered Jun 03 '20 at 06:53

Shubham Jain

5,327
2
15
38

This solution works well. I changed two things in the filter method: – samiam Jun 03 '20 at 23:41
1) The filter statements need to be in parenthesis 2) Removed the equal sign from the second filter statement, so we get the 50 rows. – samiam Jun 03 '20 at 23:49

Efficiently batching Spark dataframes to call an API

1 Answers1

Linked