Postgres in Python

Question

I have a block of code that I currently don't understand, and I'm not sure if it's running as intended. It'd be appreciated if someone could assist with some information on how to refactor this code.

conn = psycopg2.connect(conn_string) cur = conn.cursor()
query = """ 
        SELECT 
          users.id, 
          users.company_id, 
          users.state, 
          users.phase, 
          transactions.amount 
        FROM 
          cases 
          INNER JOIN transactions ON user.id = transactions.user_id
        WHERE users.company_id IN (9, 15) 
        AND transactions.date < '2018-01-01' 
"""

fields = ['id','company_id', 'state', 'phase','amount']

cur.execute(query) data = cur.fetchall()
internal = []
for d in data: 
    internal += [ [''] * 4 ] 
    internal[-1][0] = d[1] 
    internal[-1][1] = d[fields.index('phase')] == 'activated' 
    internal[-1][2] = d[fields.index('state')] 
    internal[-1][3] = 'success' if d[4] > 0 else 'fail'

print(internal)

As much I would like to help you getting familiar with Python and being helpful in general, this question leaves a lot to imagine. How many records in the DB are you reading, what is your performance expectation, have you made an effort to understand it, have you worked with Python before, etc.. Especially the last lines make me cringe a little and could be a source of the issue. Depending on how you run this the print(internal) may actually be holding you back, e.g. Sublime Text console gets really slow at printing a lot of values. Again, alot left out of the question that is needed. — d parolin, Jun 06 '18 at 17:42
So the transactions table is roughly 2m rows, and the users table is 40 rows. I am relatively new to Python. Although the code runs, I'm confused by what is meant to happen from the internal part of the code. — walecrackers, Jun 06 '18 at 18:18
OK, just be aware that people usually expect that you do an attempt, reading the documentation etc. I will do an effort explaining it in an answer, there are problems in the source as you have posted it, conn = psycopg2.connect(conn_string) cur = conn.cursor() need to be separate lines as well as cur.execute(query) data = cur.fetchall() — d parolin, Jun 06 '18 at 18:46
While I am commenting on your code, I figured that **internal** is really big at the time you print, which is probably not really what you want from that code, how long does your code take to start printing results ? — d parolin, Jun 06 '18 at 18:55
Thanks for assisting with this. It's currently taking around 15secs for the code to start printing results. — walecrackers, Jun 06 '18 at 19:12
Yeah that is what I thought, so the results are ready in 15 seconds, so the iteration is not really the problem, the printing of ALL results is. What is your intention with the results ? As I indicated in my answer, try printing one first instead of all by adding an index like print(internal[0]) which yields the first result. 15 seconds do not sound too bad for 2m records. — d parolin, Jun 06 '18 at 19:17

d parolin · Accepted Answer · 2018-06-06T19:27:29.767

This shall guide you through the code, step by step, the comments apply always to the line after them to explain

# Imports and connection string setup
# ...

# You create a connection object and a cursor here, which are prerequisites to 
# executing queries.
conn = psycopg2.connect(conn_string)
cur = conn.cursor()

# This is your query, however this is static and you would have to manipulate 
# this query every time you want something different.
# If you want to dynamically change stuff look at the links 
# posted below the code.

query = """
        SELECT
            users.id,
            users.company_id,
            users.state,
            users.phase,
            transactions.amount
        FROM
            cases
            INNER JOIN transactions ON user.id = transactions.user_id
        WHERE users.company_id IN (9, 15)
        AND transactions.date < '2018-01-01'
"""

# The following line is unnecessary, although it may help understand what
# these columns are named
# To get all column names from the DB you can use
#   colnames = [desc[0] for desc in curs.description]
# After you executed your query below
fields = ['id', 'company_id', 'state', 'phase', 'amount']

cur.execute(query)

# Here you fetch all data, which can be a lot (you may want to check
# memory consumption)
data = cur.fetchall()

# Setting up a result list and iterate over the results of the query
internal = []
for d in data:
    # The next line is not needed, we do not need to append an empty list
    # to the results as we can just append the final result for this row
    # further += is overloading the operator, it is cleaner (IMHO) to
    # use .append() here
    internal += [[''] * 4]
    # Now the remainder in the loop is interesting, we add an empty list
    # of values to the end of the results (internal) and then use the last
    # index to update the values
    # Apart from printing the full results in the last step this will probably
    # incur some overhead compared to creating the single row's result first
    # and then appending/printing it

    # This simply assigns the value of the current rows index 1 to the last
    # item in internal (the empty list added before)
    internal[-1][0] = d[1]

    # Here we obtain the index of 'phase' from fields to know which column
    # the value is

    # If you never change the fields variable above, your can just simply index
    # it directly.

    # After knowing the index (fields.index('phase')) which would be 3 we do a
    # comparison if it is 'activated' as a value
    # So the result will be True or False for everything that is
    # not 'activated'
    internal[-1][1] = d[fields.index('phase')] == 'activated'
    # Here we simply find the column idx for state (which is 2) and add the
    # value.
    internal[-1][2] = d[fields.index('state')]
    # Here we check if the amount is > 0 and assign 'success' if so, otherwise
    # 'fail'.
    internal[-1][3] = 'success' if d[4] > 0 else 'fail'

# You are printing ALL results here, for simple check you may want to print
# only the first or last result as such:

# print(internal[0])
# or
# print(internal[-1])

print(internal)

[EDIT] Adding refactored result "parsing" here (this is just one of the options)

conn = psycopg2.connect(conn_string)
cur = conn.cursor()

query = """
        SELECT
            users.id,
            users.company_id,
            users.state,
            users.phase,
            transactions.amount
        FROM
            cases
            INNER JOIN transactions ON user.id = transactions.user_id
        WHERE users.company_id IN (9, 15)
        AND transactions.date < '2018-01-01'
"""

fields = ['id', 'company_id', 'state', 'phase', 'amount']

cur.execute(query)
data = cur.fetchall()

internal = []
for d in data:
    # We simply create a list with all the same results and append it
    row = [
        d[1],
        d[fields.index('phase')] == 'activated',
        d[fields.index('state')],
        'success' if d[4] > 0 else 'fail'
    ]
    internal.append(row)

# print(internal)
print(internal[0])

There are further simplifications that can be done to the loop through data (in the sense to make it easier to understand), but those will not severely improve performance.

Postgres Python Query Parameters

@walecackers Please let others know if e.g. the print statement was the bottleneck for you. — d parolin, Jun 06 '18 at 19:59

Postgres in Python

1 Answers1