1

I'm running a simple file format conversion but on a very large file, from 1Gb file to a 15Gb file. Data is read into a dataframe, some processing is done and the written out to a new format. When this was running in Python 2.7 it took about 45 minutes. Now in Python 3.6 it's taking 10 hours. I don't know why the big difference. Here's a small section of the output code. In total it's writing out 663 lines for each loop and goes through 720,000 loops. I would appreciate any suggestions on improving speed and what make Python 3.6 so slow compared to 2.7.

ASE = open (os.path.join(OutputDir,OutputFileName) ,'w',newline = '\n', buffering=131072)
ASE.write('ScenSet,SetName,ScenName,Scenario Shift Rule,ScenType,ScenValue\n')
for i,row in dfScen.iterrows():
if i[1] != 0:
    if i == (1,1):
        ASE.write(',{0},{0}_{1},1,\n'.format(Setname,i[0]))
    elif i[1] == 1:
        ASE.write(',,{0}_{1},1,\n'.format(Setname,i[0]))
    '''write assets'''
    ASE.write(',,,,,SPTR Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['SPTR'] * SPTRLevel))
    ASE.write(',,,,,NDX Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['NDX'] * NDXLevel))
    ASE.write(',,,,,SX5E Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['SX5E'] * SX5ELevel))
    ASE.write(',,,,,BOND Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['BOND'] * BONDLevel))
    ASE.write(',,,,,CASH Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['CASH'] * CASHLevel))
    ASE.write(',,,,,SPX Index,{0},,@Sliding Axis 2,@Sliding Axis 2,@Standard(),Trigger Time,absolute,Term/Time,0\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,0,{0:.2f}\n'.format(row['SPTR'] * SPXLevel))
    ASE.write(',,,,,VolCurveShock SPTR,{0},,@Constant,@Constant,@Standard(),Trigger Time,non-parallel shift,Moneyness/Option Term,1826,3653\n'.format(i[1]))
    ASE.write(',,,,,,,,,,,,,1,{0:.7f},{1:.7f}\n'.format(row['ImpliedVolatility,5'],row['ImpliedVolatility,10']))
J.Sung
  • 27
  • 5
  • You can (and should) profile this, one possibility is this has to do with encoding. If what you are writing is csv, it would likely also be faster to let pandas do the writing. – pvg Jun 12 '17 at 15:07
  • Sorry I'm not familiar with profiling. Is that for timing each process? I know the time is taken at the formating/lookups such as row['CASH'] but I haven't found a faster way to do that. I tried looping through arrays but it's just as slow. I've tried writing to memory first before writing to file but no difference. If I put in a fixed value then it's back to 45 minutes but that defeats the purpose. – J.Sung Jun 12 '17 at 15:48
  • Take a look at https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script or just google python profilers. Do this for a subset of your code, similar to the example you've shown. – pvg Jun 12 '17 at 15:52
  • With about 500 million calls to format() and write(), the time is almost certainly being spent in your calls to format and write. Maybe you should stick with Python 2.7? – DisappointedByUnaccountableMod Jun 12 '17 at 16:39
  • I would go back but I've added some functions in Pandas that's only available in 3+ that actually saves time due to vectoring. It would be long process if I had to go back and build code for that. Plus I'd be stuck with future enhancements I have planned with Pandas and Bokeh. – J.Sung Jun 13 '17 at 01:28
  • But what happened in Python 3+ that's making this format() so much slower? – J.Sung Jun 13 '17 at 01:34
  • Using a well-know search engine might find a lot more info than sitting here hoping for someone to give you an answer. Maybe _python 3 slower format_ would work? – DisappointedByUnaccountableMod Jun 13 '17 at 09:06
  • Thanks for the suggestions. Just to close the loop on this I did follow the trail and eventually upgraded to Python 3.6.1 which improved the format and write speed down to 2 1/2 hours, faster than f-string. Unfortunately 3.6.1 made some other dataframe functions slower but it's a good trade off for now. Thanks everyone. – J.Sung Jun 14 '17 at 13:45

0 Answers0