Python multi-threading with Postgresql

Question

I have a very basic application that downloads data from a server via a socket connection one record at a time and writes each record to a postgresql database. This is only used by me.

downloader = DataDownloader.Downloader()
con = psycopg2.connect(datebase='db', user='username', password='pswrd')

downloader.get_data(con)
con.close()

In the above code I pass the psycopg2 connection to an instance of a class which then does all the downloading and database updates.

The issue is that this process is extremely slow since within get_data() the class only downloads one record at a time. I have started to change this to incorporate threading:

downloader = DataDownloader.Downloader()

records_to_download = 'abc'
thread1 = threading.Thread(target=downloader.get_data, args=(records_to_download))
thread1.start()

records_to_download = 'xyz'
thread2 = threading.Thread(target=downloader.get_data, args=(records_to_download))
thread2.start()

records_to_download = 'ghj'
thread3 = threading.Thread(target=downloader.get_data, args=(records_to_download))
thread3.start()

All i've done is once again create a single downloader instance. Then create 3 threads which each runs get_data by passing it a parameter specifying which records to download.

Instead of creating a psycopg2 connection and passing it to get_data I have instead created the connection within get_data so that each thread will have it's own connection.

My question is, is that sufficient and correct to ensure all the threads will 'play nice' with the database simultaneously? Each thread is still using the same username/password to access the database. I have ensured that each thread will be updating different records by passing it a records_to_download parameter.

I have heard about connection pooling but is that overkill for a one user, multi-thread setup such as this? (I have never used connection pooling and my database knowledge is very basic,...so if this is not necessary then I would avoid it).

Lastly, is there a limit to how many threads I can use for this kind of setup before being forced to use a more robust approach such as connection pooling? Or is connection pooling required right from the get go?

score 6 · Accepted Answer · edited May 23 '17 at 11:46

is that sufficient and correct to ensure all the threads will 'play nice' with the database simultaneously?

Indeed, this is the most straightforward way to achieve parallelism.

(Note that PostgreSQL supports, in fact, parallelism through a single connection, but to my knowledge it's only readily available in the C driver. There's also executemany and the trick of inserting multiple rows with a single INSERT statement (cf. this question)).

Now that you have efficient parallelism through multiple connections you should check for server-side bottlenecks. In particular, COMMITting transactions to disk in a super-durable way that ACID requires might keep the updates slow as they are waiting on the operating system and storage devices to report back the successfull arrival of the payload.

To that end you might want to relax the durability a bit by issuing SET synchronous_commit TO off in every connection that you use for the bulk updates. (Cf. synchronous_commit and this question).

Alternatively you might group the updates in large transactions, delaying the durability-to-disk issues until the transaction commit.

I have heard about connection pooling but is that overkill for a one user, multi-thread setup such as this?

Connection pooling is used primarily to alleviate the cost of establishing and tearing down connections. In some setups it can also decrease the number of connections the database needs to maintain.

Consider the following use case: 1) User visits a server page; 2) Server establishes a database connection; 3) Server checks if the user exists by querying the database; 4) Server tears down the connection; 5) Server sends the page to the user.

In such a repetitive task the second and fourth steps are usually optimized with some kind of connection pooling.

Another use case: 1) User visits a server page; 2) A server process establishes and caches a database connection; 3) Server checks if the user exists by querying the database; 4) Server sends the page to the user.

Here every server process keeps its own database connection in order not to establish a new one every time there is a request. So if you have, say, 200 PHP processes running then there are 200 database connections. This takes RAM from the database server to maintain the connections. Connection pooling might be used instead, to achieve a similar performance without the need to maintain 200 connections.

In your scenario I see no need for connection pooling. Also having your own connections allows you to tune them further with synchronous_commit, work_mem etc, which is not advisable when using a connection pool because by leaving the tweaked connections in the pool you might affect the durability and performance of other pool users in an uncontrolled way.

Lastly, is there a limit to how many threads I can use for this kind of setup before being forced to use a more robust approach such as connection pooling? Or is connection pooling required right from the get go?

Multithreading is not a silver bullet. You need only as many threads and connections as to counter the bottlenecks imposed by the serial nature of the driver communications with the database (indeed, if you were to use the asynchronous pipelined version of the C driver then you'd likely need no threading at all) and reach the performance plank imposed by the database server configuration and hardware. Throwing more threads at the problem than needed will only scatterbrain the system, making it work even slower (cf. Reasons for Performance Reduction Past the "Knee").

Test and measure to come up with the optimal number of threads and connections for your use case.

The performance is your limit, throw too much threads at the problem and the performance will suffer.

As for connection pooling, you only need it in specific use cases.

Thanks for the detailed answer, it is very helpful. – darkpool Dec 08 '15 at 06:56 — darkpool, Dec 08 '15 at 06:56

Python multi-threading with Postgresql

1 Answers1