Changing configuration at runtime for PySpark

Question

I was trying to deploy a trained Faiss index to PySpark and do a distributed search. So the whole process includes:

Pre-process
Load Faiss Index(~15G) and do Faiss Search
Post-process and write to HDFS

I set CPUs per task as 10 (spark.task.cpus=10) in order to do multi-thread search. But step 1 and step 3 can only utilize 1 CPU per task. In order to utilize all CPUs I want to set spark.task.cpus=1 before step 1 and 3. I have tried set method of RuntimeConfig but it seems it made my program stuck. Any advice on how to change config at runtime or how to optimize this problem?

Code example:

def load_and_search(x, model_path):
    faiss_idx = faiss.read_index(model_path)
    q_vec = np.concatenate(x)
    _, idx_array = faiss_idx.search(q_vec, k=10)
    return idx_array


data = sc.textFile(input_path)

# preprocess, only used one cpu per task
data = data.map(lambda x: x)

# load faiss index and search, used multiple cpus per task
data = data.mapPartitioins(lambda x: load_and_search(x, model_path))

# postprocess and write, one cpu per task
data = data.map(lambda x: x).saveAsTextFile(result_path)

Can you give a more specific code example of what you are trying to do so that it will be easier to help you with optimization? — user3689574, Apr 27 '20 at 06:45
what is `load_and_search ` doing this is not a Spark code right? — abiratsis, May 13 '20 at 18:05

BlackBear · Answer 1 · 2020-05-13T09:04:48.750

1

Alternative idea: use mapPartitions for steps 1 and 3. Then, use a multiprocessing pool within each worker to map the items in the partition in parallel. This way, you can use all cpus assigned to a worker without changing configuration (which I do not know if it is at all possible).

Pseudocode:

def item_mapper(item):
    return ...

def partition_mapper(partition):
    with mp.Pool(processes=10) as pool:
        yield from pool.imap(item_mapper, partition)

rdd.mapPartitions(partition_mapper)

edited May 13 '20 at 09:04

answered May 13 '20 at 08:59

BlackBear

22,411
10
48
86

Modifying configuration at runtime isn't supported. – Oluwafemi Sule May 13 '20 at 09:01

score 1 · Answer 2 · answered May 13 '20 at 09:23

1

Well you can change the sparkContext properties in the following ways:

conf = sc._conf.setAll([('spark.task.cpus','1')])
sc._conf.getAll()
data = data.map(lambda x: x)

conf = sc._conf.setAll([('spark.task.cpus','10')])
sc._conf.getAll()
# load faiss index and search, used multiple cpus per task
data = data.mapPartitioins(lambda x: load_and_search(x, model_path))

conf = sc._conf.setAll([('spark.task.cpus','1')])
sc._conf.getAll()
# postprocess and write, one cpu per task
data = data.map(lambda x: x).saveAsTextFile(result_path)

getAll() can be removed, added just to check the current configuration.

answered May 13 '20 at 09:23

Shubham Jain

5,327
2
15
38

Did you test this? – BlackBear May 13 '20 at 12:03
I have tested the setup of conf variable and it works. – Shubham Jain May 13 '20 at 12:40
Alright setting the variable works, but is the parallelism actually changed when you execute the code? – BlackBear May 13 '20 at 14:24

Changing configuration at runtime for PySpark

2 Answers2