1

I am trying to learn how to save dataframe created in pandas into postgresql db (hosted on Azure). I planned to start with simple dummy data:

data = {'a':  ['x', 'y'],
        'b': ['z', 'p'],
        'c': [3, 5]
        }

df = pd.DataFrame (data, columns = ['a','b','c'])

I found a function that pushed df data into psql table. It starts with defining connection:

def connect(params_dic):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = psycopg2.connect(**params_dic)
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    print("Connection successful")
    return conn
conn = connect(param_dic)

*param_dic contains all connection details (user/pass/host/db) Once connection is established then I'm defining execute function:

def execute_many(conn, df, table):
    """
    Using cursor.executemany() to insert the dataframe
    """
    # Create a list of tupples from the dataframe values
    tuples = [tuple(x) for x in df.to_numpy()]
    # Comma-separated dataframe columns
    cols = ','.join(list(df.columns))
    # SQL quert to execute
    query  = "INSERT INTO %s(%s) VALUES(%%s,%%s,%%s)" % (table, cols)
    cursor = conn.cursor()
    try:
        cursor.executemany(query, tuples)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    print("execute_many() done")
    cursor.close()

I executed this function to a psql table that I created in the DB:

execute_many(conn,df,"raw_data.test")

The table raw_data.test consists of columns a(char[]), b(char[]), c(numeric). When I run the code I get following information in the console:

Connecting to the PostgreSQL database...
Connection successful
Error: malformed array literal: "x"
LINE 1: INSERT INTO raw_data.test(a,b,c) VALUES('x','z',3)
                                                ^
DETAIL:  Array value must start with "{" or dimension information.

I don't know how to interpret it because none of the columns in df are array

df.dtypes
Out[185]: 
a    object
b    object
c     int64
dtype: object

Any ideas what goes wrong there or suggestions how to maybe save df in pSQL in a simpler manner? I found quite a lot of solutions that use sqlalchemy with creating connection string in following way:

conn_string = 'postgres://user:password@host/database'

But I am not sure if that works on cloud db- if I try to edit such connection string with azure host details it does not work.

Agatella
  • 11
  • 2
  • PostgreSQL is telling you that your table "test" has an array column. And you aren't passing it an array. It's columns "a" by the look of it, but if you posted the table definition someone could verify this. – Richard Huxton Jul 04 '21 at 11:10
  • Thanks Richard, the issue in this case was related to datatypes set on PSQL side. – Agatella Jul 05 '21 at 07:07

2 Answers2

0

The usual data type for strings in PostgreSQL is TEXT or VARCHAR(n) or CHAR(n), with round brackets; not CHAR[] with square brackets.

  • I'm guessing that you want the column to contain a string and that CHAR[] was a typo; in that case, you'll need to recreate (or migrate) the table column to the correct type - most likely TEXT.

    (You might use CHAR(n) for fixed-length data, if it's genuinely fixed-length; VARCHAR(n) is mostly of historical interest. In most cases, use TEXT.)

  • Alternately, if you do mean to make the column an array, you'll need to pass a list in that position from Python.

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
  • Thank you @sabik for this advice. I dropped the table and re-created it with `TEXT` as a datatype and it resolved the issue. – Agatella Jul 04 '21 at 11:53
0

Consider adjusting your parameterization approach as psycopg2 supports a more optimal approach to format identifiers in SQL statements like table or column names.

In fact, docs indicate your current approach is not optimal and poses a security risk:

# This works, but it is not optimal
query = "INSERT INTO %s(%s) VALUES(%%s,%%s,%%s)" % (table, cols) 

Instead use psycop2.sql module:

from psycopg2 import sql 
...

query = (
    sql.SQL("insert into {} values (%s, %s, %s)") 
    .format(sql.Identifier('table'))
)
...
cur.executemany(query, tuples)

Also, for best practice in SQL always include column names in append queries and do not rely on column order of stored table:

query = (
    sql.SQL("insert into {0} ({1}, {2}, {3}) values (%s, %s, %s)") 
    .format(
        sql.Identifier('table'), 
        sql.Identifier('col1'),
        sql.Identifier('col2'), 
        sql.Identifier('col3')
    )
)

Finally, discontinue using % for string formatting across all your Python code (not just psycopg2). As of Python 3, this method has been de-emphasized but not deprecated yet! Instead, use str.format (Python 2.6+) or F-string (Python 3.6+).

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you for pointing me to the relevant documentation. When I tried to re-apply the oryginal approach to actual data set that has 13 columns I again run into errors and I think I need to re-do query in a format suggested by you. – Agatella Jul 05 '21 at 07:08
  • Yes, give it a try and explicitly define columns in append query and line up corresponding values to data types. – Parfait Jul 05 '21 at 14:11