Run DBAPI call asynchronously in Pyspark application

Question

I have an application that creates a few dataframes, writes them to disk, then runs a command using vertica_python to load the data into Vertica. The Spark Vertica connector doesn't work because of an encrypted drive.

What I'd like to do, is have the application run the command to load the data, then move on to the next job immediately. What it's doing however, is waiting for the load to be done in Vertica before moving to the next job. How can I have it do what I want? Thanks.

What's weird about this problem is that the command I'd like to have run in the background is as simple as db_client.cursor.execute(command). This shouldn't be blocking under normal circumstances, so why is it in Spark?

To be very specific, what is happening is that I'm reading in a dataframe, doing transformations, writing to s3, and then I'd like to start the db loading the files from s3, before moving taking the transformed dataframe, transforming it again, writing to s3, loading to db.... multiple times.

Possible duplicate of [How to run 2 functions doing completely independent transformations on a single RDD in parallel using pyspark?](https://stackoverflow.com/questions/38048068/how-to-run-2-functions-doing-completely-independent-transformations-on-a-single) — Alper t. Turker, May 22 '18 at 07:08

score 0 · Answer 1 · answered May 22 '18 at 23:16

0

I see now what I was doing. Simply putting the dbapi call in its own thread isn't enough. I have to put the other calls that I want to run concurrently in their own threads as well.

answered May 22 '18 at 23:16

BossColo

63
5

Run DBAPI call asynchronously in Pyspark application

1 Answers1