0

I am working on speeding up an upsert method which takes a Pandas DF and executes a query to insert all the data from the DF into a postgres DB. The slowest part of my method is a for loop which calls a cast function to transform all python datatypes to their respective postgresql types using a map and creates a string of all these values (called vals) to be inserted in my query later on.

cdef str vals = ''
df['row_statement'] = '('
cdef str c
for c in cols:
    df.row_statement += df[c].apply(lambda x: self.cast(c, x)).astype(str) + ', '

df.row_statement = df.row_statement.str[:-2] + ')'
df.row_statement += ','
vals = df.row_statement.str.cat()
vals = vals[1:-2]
df = df.drop('row_statement', axis=1)
def cast(self, column_name, value, emptyList=False):
    cdef n = sqlalchemy.sql.null()
    if not isinstance(value, list):
        if pd.isnull(value) or value == '':
            return n
    elif emptyList and not value:
        return n
    cdef str typ = self._data_types[column_name]
    return self._cast_map[typ\(value)

My _cast_map is basically a dictionary of lambda functions like this

'text': lambda x: f"'{x}'",
'date': lambda x: f"'{x}'::date",
'timestamp without time zone': lambda x: f"'{x}'::timestamp",

This is my query string

 qry = f"""
            INSERT INTO {self.schema}.{self.table} ({','.join(df.columns.tolist())})
            VALUES ({vals})
            {self._conflict if self._conflict else ''}
    """

When researching for a faster method than apply I keep hearing about vectorization. Is it possible for me to vectorize in this case because my cast function requires the column name as well as the value? What would be the best way to go about speeding up a function like this? Is there a better way to cast each value other than a dictionary of lambda functions?

Ive tried iterating through the dataframe as numpy arrays but this seems to be slower even after cythonizing as much as I can. I also tried using Numba as well as pandarellel and swiftly with no luck :(

cdef int r
cdef int c
cdef int c_max = df_arr.shape[1]
cdef int r_max = df_arr.shape[0]

for r in range(r_max):
    vals += '('
    for c in range(c_max):
        vals += str(self.cast(index[c], df_arr[r, c])) + ','
    vals = vals[:-1]
    vals += '),'
vals = vals[1:-2]
DZ14
  • 31
  • 3
  • 1
    Lambda functions are always comparatively slow. But, why are you reinventing the wheel? Why not use sqlalchemy's native functionality? Like [df.to_sql()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html)? – MatBailie Dec 14 '22 at 23:40
  • @MatBailie it is far slower and tends to fail on very large inserts. The above method inserts 10,000 rows of data in around 5.5 seconds whereas df.to_sql takes around 20 seconds to complete :( – DZ14 Dec 15 '22 at 00:23
  • 1
    Even with `method='multi'`? Precisely what is your `to_sql` code? – MatBailie Dec 15 '22 at 00:26
  • @MatBailie yup.. df.to_sql(self.table, self.engine, self.schema, if_exists='append',method='multi', index=False, chunksize=1000) – DZ14 Dec 15 '22 at 00:29
  • Abuse convertion to csv? https://stackoverflow.com/questions/31997859/bulk-insert-a-pandas-dataframe-using-sqlalchemy – MatBailie Dec 15 '22 at 00:44
  • This may have useful options, but I'm not convinced... https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-enabled-insert-update-and-delete-statements – MatBailie Dec 15 '22 at 00:50
  • https://stackoverflow.com/questions/29706278/python-pandas-to-sql-with-sqlalchemy-how-to-speed-up-exporting-to-ms-sql/53091923#53091923 – MatBailie Dec 15 '22 at 00:53
  • https://stackoverflow.com/questions/38204064/turn-pandas-dataframe-into-a-file-like-object-in-memory – MatBailie Dec 15 '22 at 00:55

0 Answers0