Librosa feature extraction methods with PySpark

Question

I've been searching long time but can't see any implementation about music feature extraction techniques (like spectral centroid, spectral bandwidth etc.) integrated with Apache Spark. I am working with these feature extraction techniques and the process takes a lot of time for music. I want to parallelize and accelerate this process by using Spark. I did some works but couldn't get any speed up. I want to get arithmetic mean and standard deviation of spectral centroid method. This is what I've done so far.

from pyspark import SparkContext
import librosa
import numpy as np
import time

parts=4
print("Parts: ", parts)
sc = SparkContext('local['+str(parts)+']', 'pyspark tutorial')

def spectral(iterator):
    l=list(iterator)
    cent=librosa.feature.spectral_centroid(np.array(l), hop_length=256)
    ort=np.average(cent)
    std=np.std(cent)
    return (ort, std) 

y, sr=librosa.load("classical.00080.au")  #This loads the song.

start1=time.time()
normal=librosa.feature.spectral_centroid(np.array(y), hop_length=256)  #This is normal technique without spark
end1=time.time()

print("\nOrt: \t", np.average(normal))
print("Std: \t", np.std(normal))
print("Time elapsed: %.5f" % (end1-start1))

#This is where my spark implementation appears.
rdd = sc.parallelize(y)
start2=time.time()
result=rdd.mapPartitions(spectral).collect()
end2=time.time()
result=np.array(result)

total_avg, total_std = 0, 0
for i in range(0, parts*2, 2):
    total_avg += result[i]
    total_std += result[i+1]
spark_avg = total_avg/parts
spark_std = total_std/parts

print("\nOrt:", spark_avg)
print("Std:", spark_std)
print("Time elapsed: %.5f" % (end2-start2))

The output of the program is below.

Ort:     971.8843380584146
Std:     306.75410601230413
Time elapsed: 0.17665

Ort:     971.3152955225721
Std:     207.6510740703993
Time elapsed: 4.58174

So, even though I parallelized the array y (the array of music signal), I can't speed up the process. It takes longer time. I couldn't understand why. I am newbie with Spark concept. I thought to use GPU for this process but couldn't implement that either. Can anyone help me to understand what I am doing wrong?

[Spark: Inconsistent performance number in scaling number of cores](https://stackoverflow.com/q/41090127/8371915) — Alper t. Turker, May 27 '18 at 08:26
Spark is *not* about parallelization of processes in a single machine; if your data can indeed fit into the memory of your single machine, involving Spark will most probably make the implementation *slower*. Spark is about big data sets that cannot fit into the main memory of a single machine, thus parallelizing the process across a *cluster* of many machines... — desertnaut, May 29 '18 at 16:43
See this [Databricks blog post](https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html) for a use case involving single node computation. — michen00, Oct 04 '21 at 01:11

Librosa feature extraction methods with PySpark

0 Answers0