I have a ~250,000 row dataset I am pulling from Oracle DB. The main data I am concerned with is a text field that is pulled as a HUGECLOB
object. Storing my data to a CSV took quite a bit of time, so I decided to switch to Feather via: Pandas read_csv speed up
There was some other question that used feather too, but I cannot find it. I tried to_hdf
, but for some reason that did not work. Either way, to get my query results to work with Feather, the text column must be converted to string. So I have:
SQLquery = ('SELECT*')
datai = pd.read_sql(SQLquery, conn)
print("Query Passed, start date")
datai['REPORTDATE'] = pd.to_datetime(datai['REPORTDATE'], format='%m-%d-%Y')
print("Row done, string")
datai['LOWER(LD.LDTEXT)'] = datai['LOWER(LD.LDTEXT)'].apply(str)
print("Data Retrieved")
print("To feather start")
datai.to_feather(r'C:\Users\asdean\Documents\Nuclear Text\dataStore\rawData2.feather')
print("Done with feather")
Note: I put a bunch of print statements because I was trying to figure out where it got hung up.
The text column is identified as datai['LOWER(LD.LDTEXT)']
. Some of the rows contain quite a bit of text (~ a couple paragraphs). The string conversion TAKES FOREVER (or may not even be completing). I did not have this problem when I read it from an old CSV (old data, no longer use, we updated the query, etc).
I have tried all the common ways of doing this with astype(str)
, map(str)
, apply(str)
, values.astype(str)
with no success of speeding it up to a reasonable (less than 1 hour) pace. Is there a way to do faster object to string conversions? Is there a library I am missing here? Is there a faster way through Oracle/HUGECLOB? How can I speed this up?