Using Pandas,loading data from excel to redshift using python, able to load data. but when excel has 20000+ rows taking 7+ hours. way to optimize

Question

I am experiencing slow performance when using Pandas to load data from an Excel file into an existing Redshift table. The Excel file has 10+ columns and 20000+ rows, and the operation is taking over 7 hours to complete. Is there a way to optimize the code and increase performance? Note that I do not have access to S3 and cannot use it as an option.

# Establish a connection to Redshift
conn = psycopg2.connect(
    host='your-redshift-host',
    port='your-redshift-port',
    user='your-username',
    password='your-password',
    dbname='your-database'
)

# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)

# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"

for index, row in df.iterrows():
    cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))

conn.commit()
cur.close()

Did you try `to_sql`? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html — MP24, Jun 20 '23 at 22:18

score 0 · Answer 1 · answered Jun 16 '23 at 04:24

Welcome to the community! You are iterating through the dataframe using df.iterrows(), it is not recommended to use this function and iterate any dataframe using loops (source).

I would recommend using apply function, find more about apply here.

In your case, the following modifications could provide a better performance-

# Establish a connection to Redshift
conn = psycopg2.connect(
    host='your-redshift-host',
    port='your-redshift-port',
    user='your-username',
    password='your-password',
    dbname='your-database'
)

# Copy the contents of excel_file to a dataframe
df = pd.read_excel(excel_file)


# Insert dataframe records into the table
cur = conn.cursor()
cur.execute('TRUNCATE TABLE table_name')
insert_query = "INSERT INTO table_name (tablecolumn1, tablecolumn2, tablecolumn3) VALUES (%s, %s, %s)"

df.apply(lambda row: cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3'])), axis=1)

# for index, row in df.iterrows():
    # cur.execute(insert_query, (row['dfcolumn1'], row['dfcolumn2'], row['dfcolumn3']))

conn.commit()
cur.close()

Using Pandas,loading data from excel to redshift using python, able to load data. but when excel has 20000+ rows taking 7+ hours. way to optimize

1 Answers1