0

I have a dataframe with 4 million rows and 53 columns. I am trying to write the dataframe to an oracle table. see below a snipet of my code in python;

import pandas as pd
import cx_Oracle
conn = (--------------)
df = pd.read_sql(------)

#write to oracle table
df.to_sql(---)

This code has been running for over three days now with no end in sight. Please how can i get the progress of the insertion?

PS: My connection is working well and i already confirmed that the "to_sql()" is working cos i tried it on a dataframe with 10 rows and it worked.

Edited: Thanks everyone, this link helped. Did explicit conversion of the str and my code executed in 26mins!

lekz
  • 11
  • 2
  • 1
    Did you try opening a parallel connection to your oracle db and read the table that you are putting data into with pandas? I mean a different script that you execute without interrupting your current process. – Aerials Jan 13 '20 at 08:08
  • 3
    Unless insert has been COMMITed, parallel connection won't see anything, @Aerials. – Littlefoot Jan 13 '20 at 08:18
  • Can you show us the parameters you are using with df.to_sql method? – Aerials Jan 13 '20 at 08:51
  • @Aerials df_all.to_sql('table_name', con = e, if_exists = 'replace',chunksize = 10000, index = False) – lekz Jan 13 '20 at 09:14

1 Answers1

0

You can check if the insert is already running or it is still the pd.read_sql step by if there is a table lock on your target table. With some bad luck you are still loading the data from the database. Here you should check if it's faster to use push down technology. Getting all data out of a database to insert it back can be slow sometimes.

Checking the sessions will not help because you are probably inserting the rows 1 by 1 ... I don't know if any of undo/redo or final segments for the table is filled during the insert. But it seems that your oracle database is quite big, so probably someone can help you.

Next time you should use some additional parameters when doing such a big operation.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql

-> Read by iterating over data ... reading the total 4 GB Data should be possible but splitting in parts will be faster.

think of using parameter chunksize 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

As written by @lekz use chunksize, but when doing many insert you should also

think of using parameter method ('multi' or callable) + chunk size

this should increase the speed as well.

Thomas Strub
  • 1,275
  • 7
  • 20