I am working on speeding up an upsert method which takes a Pandas DF and executes a query to insert all the data from the DF into a postgres DB. The slowest part of my method is a for loop which calls a cast function to transform all python datatypes to their respective postgresql types using a map and creates a string of all these values (called vals) to be inserted in my query later on.
cdef str vals = ''
df['row_statement'] = '('
cdef str c
for c in cols:
df.row_statement += df[c].apply(lambda x: self.cast(c, x)).astype(str) + ', '
df.row_statement = df.row_statement.str[:-2] + ')'
df.row_statement += ','
vals = df.row_statement.str.cat()
vals = vals[1:-2]
df = df.drop('row_statement', axis=1)
def cast(self, column_name, value, emptyList=False):
cdef n = sqlalchemy.sql.null()
if not isinstance(value, list):
if pd.isnull(value) or value == '':
return n
elif emptyList and not value:
return n
cdef str typ = self._data_types[column_name]
return self._cast_map[typ\(value)
My _cast_map is basically a dictionary of lambda functions like this
'text': lambda x: f"'{x}'",
'date': lambda x: f"'{x}'::date",
'timestamp without time zone': lambda x: f"'{x}'::timestamp",
This is my query string
qry = f"""
INSERT INTO {self.schema}.{self.table} ({','.join(df.columns.tolist())})
VALUES ({vals})
{self._conflict if self._conflict else ''}
"""
When researching for a faster method than apply I keep hearing about vectorization. Is it possible for me to vectorize in this case because my cast function requires the column name as well as the value? What would be the best way to go about speeding up a function like this? Is there a better way to cast each value other than a dictionary of lambda functions?
Ive tried iterating through the dataframe as numpy arrays but this seems to be slower even after cythonizing as much as I can. I also tried using Numba as well as pandarellel and swiftly with no luck :(
cdef int r
cdef int c
cdef int c_max = df_arr.shape[1]
cdef int r_max = df_arr.shape[0]
for r in range(r_max):
vals += '('
for c in range(c_max):
vals += str(self.cast(index[c], df_arr[r, c])) + ','
vals = vals[:-1]
vals += '),'
vals = vals[1:-2]