0

I would like to create a parallel process which gets Data from a Database.

I am using a JDBC connector which works quite fine if i run my program not in parallel:

conn = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP",
                          ["LIB_NAME", "PWD"],
                          "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", )
curs = conn.cursor()

However I want to fasten that process and so I am using:

from joblib import Parallel, delayed, parallel_backend
with parallel_backend('multiprocessing'):
    test = Parallel(n_jobs=2, verbose=50)(delayed(function_Name)(df, i) for i in range(0, df.shape[0]))

I get no error message, but it gets just to the point where it creates the first Connection and then gets stuck. Is it not possible to have several JDBC Connections?

Update

I found this website where someone wants to use pooling within python. The updated code for that is now:

c = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP",
                          ["LIB_NAME", "PWD"],
                          "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", )
mypool = pool.QueuePool(c, max_overflow=10, pool_size=5)
# get a connection
conn = mypool.connect() ###here is runs into the error

# use it
curs = conn.cursor()
#curs = conn.cursor()

However now it runs into in error at conn = mypool.connect():

TypeError: 'Connection' object is not callable

Do I have to change somthing in the java files terajdbc4.jar or tdgssconfig.jar to configure pooling? Or did i miss a configuration?

Update2 - Nothing is impossible

So, i looked for a package where I can multiprocess small Database accesses and came up with the following code with the help of this page-->

import concurrent.futures
import JDBC_Connector
import pandas as pd
import time

def make_image_thumbnail(filename):
    print('start')
    conn, curs = JDBC_Connector.JDBC_Connector()
    curs.execute("select DISTINCT id, date "
                 "from Lib.Data "
                 "where id = 1 "                     
                 "and date = '2018-03-0"+str(filename)+"' "                 
                 "order by date;")

    df = pd.DataFrame(curs.fetchall())
    if df.shape[0] > 0:
        df.columns = [curs.description[i][0] for i in range(len(curs.description))]
    if df.shape[0]> 0:
        print('end')
        curs.close()
        return df.DATE[0]

    else:
        curs.close()
        print('end')
        return 0


start_time = time.time()
# your code

#test single thread
for item in list(range(1,10)):
    make_image_thumbnail(item)
elapsed_time = time.time() - start_time
print('not parallel')
print(elapsed_time)



start_time = time.time()
# Create a pool of processes. By default, one is created for each CPU in your machine.
with concurrent.futures.ProcessPoolExecutor(max_workers = 5) as executor:
    # Get a list of files to process
    image_files = list(range(1,10))

    # Process the list of files, but split the work across the process pool to use all CPUs!
    for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):
        print("A thumbnail for {} was saved as {}".format(image_file, thumbnail_file))

elapsed_time = time.time() - start_time

Not Parallel Process = 145.4

Parallel Process = 45.3

Not quite 5 times faster, however that depends on the size of each query. However the strange thing is: When I execute the full script, so it executes first the iterative one and then the parallel one then the parallel one gets stuck and wont go any further. Why is that so?

Mimi Müller
  • 416
  • 8
  • 25
  • Have you read any thing about JDBC Connection pools – Sunimal S.K Malkakulage Mar 21 '18 at 09:18
  • No, i am quite new to this topic. So it is not possible to do it? – Mimi Müller Mar 21 '18 at 09:20
  • Do I just have to change then something in python with a new package or do I have to also change the java driver? – Mimi Müller Mar 21 '18 at 13:19
  • In the end, will it be a standalone app or an app running in any container ( like Tomcat, Weblogic etc ) ? – Sabir Khan Mar 22 '18 at 16:10
  • [This](https://stackoverflow.com/questions/98687/what-is-the-best-solution-for-database-connection-pooling-in-python) might be helpful. – Sabir Khan Mar 22 '18 at 16:17
  • As far as I'm aware, normal Python is still effectively single-threaded because of the Global Interpreter Lock. Doing what you want to do might simply not be possible. – Mark Rotteveel Mar 22 '18 at 19:55
  • @SabirKhan not an app or container. It is just a scrip which will run daily on a server. – Mimi Müller Mar 26 '18 at 07:57
  • How about creating several scripts and let them run in parallel? So I start multiple scripts from python and wait till they are finished. Through that each python script would have its own session? – Mimi Müller Apr 10 '18 at 06:36
  • @MimiMüller Did you try with multi threading ? With IO operation GIL should not be an issue. I was thinking what if we create a connection pool before hand, i will test the **algorithm**: *Create a collection pool via SQLAlchemy.* | *Design function: `make_image_thumbnail(connection, filename)`* | *Use ThreadPool map.* | **Reason:** *With thread pool, lower foot print* | *When we use connection pool, we don't create connection everytime.* – Arnab Mukherjee Jul 01 '21 at 13:43

0 Answers0