1

I've started to use NumPy instead of MATLAB for a lot of things and for most things it appears to be much faster. I've just tried to replicate a code in Python and it is much slower though. I was wondering if someone who knows both could have a look at it and see why it is the case

NumPy:

longTicker = np.empty([1,len(ticker)],dtype='U15')
genericTicker = np.empty([len(ticker)],dtype='U15')
tickerType = np.empty([len(ticker)],dtype='U10')
tickerList = np.vstack((np.empty([2,len(ticker)],dtype='U30'),np.ones([len(ticker)],dtype='U30')))
tickerListnum = 0
modelList = np.empty([2,9999],dtype='U2')
modelListnum = 0
derivativeType = np.ones(len(ticker))

for l in range(0,len(ticker)):
    tickerType[l] = 'Future'

    if not modCode[l] in list(modelList[1,:]):
        modelList[0,modelListnum] = modelListnum + 1
        modelList[1,modelListnum] = modCode[l]
        modelListnum += 1

    if ticker.item(l).find('3 MONTH') >= 0:
        x = list(metalTicks[:,0]).index(ticker[l])
        longTicker[0,l]  = metalTicks[x,3]
        if not longTicker[0,l] in list(tickerList[1,:]):
            tickerList[0,tickerListnum] = tickerListnum + 1
            tickerList[1,tickerListnum] = longTicker[0,l] 
            tickerList[2,tickerListnum] = 4
            tickerListnum += 1

        derivativeType[l] = 4
        tickerType[l] = 'Future'

    if ticker.item(l).find('CURNCY') >= 0:
        if ticker.item(l).find('KRWUSD CURNCY'):
            prices[l] = 1/float(prices.item(l))

        longTicker[0,l]  = ticker[l,0]
        if not longTicker[0,l] in list(tickerList[1,:]):
            tickerList[0,tickerListnum] = tickerListnum + 1
            tickerList[1,tickerListnum] = longTicker[0,l] 
            tickerList[2,tickerListnum] = 2
            tickerListnum += 1

        derivativeType[l] = 2
        tickerType[l] = 'FX'    

    if ticker.item(l).find('_') >= 0:
        x = ticker[l] == sasTick
        longTicker[0,l]  = bbgTick[x]
        if not longTicker[0,l] in list(tickerList[1,:]):
            tickerList[0,tickerListnum] = tickerListnum + 1
            tickerList[1,tickerListnum] = longTicker[0,l] 
            tickerList[2,tickerListnum] = 3
            tickerListnum += 1

        derivativeType[l] = 3
        tickerType[l] = 'Option'

    # need convert ticker thing    

    if not longTicker[0,l] in list(tickerList[1,:]):
            tickerList[0,tickerListnum] = tickerListnum + 1
            tickerList[1,tickerListnum] = longTicker[0,l] 
            tickerList[2,tickerListnum] = 1
            tickerListnum += 1

MATLAB Code:

longTicker = cell(size(ticker));
genericTicker = cell(size(ticker));
type = repmat({'Future'},size(ticker));
tickerList = repmat([cell(1);cell(1);{1}],1,9999);
%tickerList = cell(3,9999);
tickerListnum = 0;
modelList = cell(2,9999);
modelListnum = 0;
derivativeType = ones(size(ticker));

for j=1:length(ticker)

    if isempty(find(strcmp(modCode{j},modelList(2,:)), 1))
        modelListnum = modelListnum+1;
        modelList{1,modelListnum}= modelListnum;
        modelList(2,modelListnum)= modCode(j);
    end

    if ~isempty(strfind(ticker{j},'3 MONTH'))
        x =strcmp(ticker{j},metalTicks(:,1));
        longTicker{j} = metalTicks{x,4};
        % genericTicker{j} = metalTicks{x,4};
        if isempty(find(strcmp(longTicker(j),tickerList(2,:)), 1))
        tickerListnum = tickerListnum+1;
        tickerList{1,tickerListnum}= tickerListnum;
        tickerList(2,tickerListnum)=longTicker(j);
        tickerList{3,tickerListnum}=4;
        end
        derivativeType(j) = 4;
        type{j} = 'Future';
        continue;
    end
    if ~isempty(regexp(ticker{j},'[A-Z]{6}\sCURNCY', 'once'))
        if strcmpi('KRWUSD CURNCY',ticker{j})
            prices{j}=1/prices{j};
        end
        longTicker{j} = ticker{j};
        % genericTicker{j} = ticker{j};
        if isempty(find(strcmp(longTicker(j),tickerList(2,:)), 1))
        tickerListnum = tickerListnum+1;
        tickerList{1,tickerListnum}= tickerListnum;
        tickerList(2,tickerListnum)=longTicker(j);
        tickerList{3,tickerListnum}=2;
        end
        derivativeType(j) = 2;
        type{j} = 'FX';
        continue;
    end
    if ~isempty(regexp(ticker{j},'_', 'once'))
        z = strcmp(ticker{j},sasTick);
        try
            longTicker(j) = bbgTick(z);
        catch
            keyboard;  % I did this - Dave
        end
        % genericTicker(j) = bbgTick(z);
        if isempty(find(strcmp(longTicker(j),tickerList(2,:)), 1))
        tickerListnum = tickerListnum+1;
        tickerList{1,tickerListnum}= tickerListnum;
        tickerList(2,tickerListnum)=longTicker(j);
        tickerList{3,tickerListnum}=3;
        end
        derivativeType(j) = 3;
        type{j} = 'Option';
        continue;
    end
    try
        longTicker{j} = ConvertTicker(ticker{j},'short','long',tradeDate(j));
        % genericTicker{j} = ConvertTicker(ticker{j},'short','generic',tradeDate(j));
    catch
        longTicker{j} = ticker{j};
        % genericTicker{j} = ticker{j};
    end
    if isempty(find(strcmp(longTicker(j),tickerList(2,:)), 1))
        tickerListnum = tickerListnum+1;
        tickerList{1,tickerListnum}= tickerListnum;
        tickerList(2,tickerListnum)=longTicker(j);
        tickerList{3,tickerListnum}=1;
    end
end

MATLAB appears to be faster by a factor of around 100 in this case. Are loops much slower in Python or something?

Lererferler
  • 287
  • 4
  • 19
  • 3
    It is hard to say what is causing this. Try to profile your python script with `python -m cProfile script.py` and MATLAB script (I know there is a profiler in matlab, but I do not know how to use it). – Konstantin Jun 10 '15 at 08:27
  • 1
    Could you please explain which parts of the code run slower? Just insert timings at crucial segments. The most important comparison is one loop duration in both codes. And maybe give an overall explanation of what the code is going to do. Concerning for loops: They tend to be slow in python, however you can greatly speed up for loops together with numpy's c compatibility when using cython. – Dschoni Jun 10 '15 at 08:37
  • This code was an excerpt from a much larger piece of code. This is the part which appears to be running very slowly, the larger for loop in general. Each iteration is running about 100x faster in MATLAB than in NumPy. The duration of one loop takes 0.05 seconds in MATLAB and roughly 4 seconds in Python – Lererferler Jun 10 '15 at 08:40
  • @Lererferler reduce the scope: add timings at the beginning of an excerpt, in the middle and in the end. Once, you've found out which half consumes most of the time, deal with it by adding timings there. Since this is the excerpt from a larger script none of us will be able to run and profile for you. – Konstantin Jun 10 '15 at 08:53
  • I see, unfortunately I have a lot of work to do this morning so I will have to do it later but I will post back timings when I do. In my ignorance I was hoping there would be an obvious thing slowing the Python script down that somebody could just point out, ie indexing would speed it up considerably or something like that. I will profile it later and report back. – Lererferler Jun 10 '15 at 08:56
  • Matlab improved the execution time of loops one or two years ago. I think they are using some kind of auto-jit. You could try PyPy. Look at this post: http://stackoverflow.com/questions/30475410/is-matlab-faster-than-python-little-simple-experiment – Moritz Jun 10 '15 at 09:26
  • @Moritz I see this would appear to be the issue then, can you use NumPy with PyPy too? I think unfortunately I have to use NumPy and it doesn't appear to be compatible with PyPy. I think I may have to just stick with MATLAB for this particular application – Lererferler Jun 10 '15 at 09:49
  • I do think PyPy is not widely used in the scientific community. – Moritz Jun 10 '15 at 09:57
  • 2
    One candidate is the repeated conversion from Numpy array to Python list. Try to avoid this by working with either one or the other. If this isn't an option it should be faster to use the `ndarray.tolist` method. –  Jun 10 '15 at 10:24
  • I think my MATLAB background is probably the reasoning for this. I haven't yet found an easy way to do a string compare in Python. If I take away the conversion to list then the string compare doesn't work with the np.array – Lererferler Jun 10 '15 at 10:31
  • Useful profiler commands in matlab: `profile on; profile off; profile viewer; profile clear;`. I think the use of the commands is self explanatory. Otherwise please comment and ask – patrik Jun 10 '15 at 11:45

1 Answers1

6

Although I can't be sure what is the primary source of the slowdown, I do notice some things that will cause a slowdown, are easy to fix, and will result in cleaner code:

  1. You do a lot of conversion from numpy arrays to lists. Type conversions are expensive, try to avoid them whenever possible. In your case, little you do benefits from numpy. You are better off just using lists in place of 1D arrays or or lists of lists in place of 2D arrays in almost all your cases. This is closer to cell arrays in MATLAB, except that they can be dynamically resized with good performance. The only possible exceptions are sastick, bbgtick, and prices, with the latter two working fine either way. For the others, in cases where you just put the value incrementally just create empty lists and use append, and for cases where you need to access an arbitrary element pre-allocate with None or empty strings ''. For tickerList it is probably easier to have two lists.
  2. You assign a lot of integers to unicode arrays. This also involves a type conversion (integer to unicode). This also wouldn't be an issue if you used lists.
  3. You use foo.item(l) a lot. This converts a numpy element to an ordinary python data type. Again, this is a type conversion, so don't do this if you can possible avoid it. If you follow my suggestion 1 and use lists, you never need to do this in the current code.
  4. You have continue statements in the MATLAB version but not in the python version, which means you are doing computation in the Python version that you skip in the MATLAB version. I think you are better off with if..elseif, but continue also works in Python.
  5. You loop over range(0,len(ticker)), and then extract that element of ticker multiple times. You are better off just looping over ticker directly, by doing, for example for i, iticker in enumerate(ticker):. Using the enumerate allows you to also keep track of the index.
  6. You use find to determine whether a substring is in a given string. It is faster, clearer, and simpler to just use in for that. Only use find if you care exactly where the substring is found, which you don't.
  7. For both modelListnum and tickerListnum, you add one, assign the value to an array element, then add one and assign it back to itself, doing the same operation twice. In the MATLAB version, you increment first, then assign the already incremented version. This involves doing the same math twice as often in Python as you do in MATLAB.
  8. It is quicker to pre-allocate tickerType to 'Future' like you do in MATLAB, which you can do by using something like tickerType = ['Future']*len(ticker).
  9. Since tickerListnum and modelListnum are always equal to the index, there is no reason to have those at all. Just get rid of them.
  10. Since there is only ever one instance of each value in the first row of tickerList, it will be faster and easier to use an OrderedDict, or a regular dict if you don't care about order, where the keys are the longTicker value and the value is the type number.
  11. If you don't care about the order of modelList, using a set will be faster.

So here is a version that should be faster, assuming metalTicks, and tickerList are lists of lists, sasTick is a numpy array, and prices and bbgTick are either lists or arrays, and assuming you care about the oder of modelList and tickerList:

from collections import OrderedDict

longTicker = [None]*len(ticker)
tickerType = ['Future']*len(ticker)
tickerList = OrderedDict()
modelList = []
derivativeType = np.ones_like(ticker)

for i, (iticker, imodCode)  in enumerate(zip(ticker, modCode)):
    if imodCode not in modelList:
        modelList.append(imodCode)

    if '3 MONTH' in iticker:
        x = metalTicks[0].index(iticker)
        longTicker[i] = metalTicks[3][x]
        derivativeType[i] = 4

    elif 'CURNCY' in iticker:
        if 'KRWUSD CURNCY' in iticker:
            prices[i] = 1/prices[i]

        longTicker[i]  = iticker
        derivativeType[i] = 2
        tickerType[i] = 'FX'    

    elif '_' in iticker:
        longTicker[i]  = bbgTick[iticker == sasTick]
        derivativeType[i] = 3
        tickerType[i] = 'Option'

    tickerList[longTicker[i]] = derivativeType[i]

If you don't care about the order of modelList and tickerList, you can do this:

longTicker = [None]*len(ticker)
tickerType = ['Future']*len(ticker)
tickerList = {}
modelList = set()
derivativeType = np.ones_like(ticker)

for i, (iticker, imodCode)  in enumerate(zip(ticker, modCode)):
    modelList.add(imodCode)

    if '3 MONTH' in iticker:
        x = metalTicks[0].index(iticker)
        longTicker[i] = metalTicks[3][x]
        derivativeType[i] = 4

    elif 'CURNCY' in iticker:
        if 'KRWUSD CURNCY' in iticker:
            prices[i] = 1/prices[i]

        longTicker[i]  = iticker
        derivativeType[i] = 2
        tickerType[i] = 'FX'    

    elif '_' in iticker:
        longTicker[i]  = bbgTick[iticker == sasTick]
        derivativeType[i] = 3
        tickerType[i] = 'Option'

    tickerList[longTicker[i]] = derivativeType[i]

Or simpler yet:

longTicker = [None]*len(ticker)
tickerType = ['Future']*len(ticker)
derivativeType = np.ones_like(ticker)

for i, iticker in enumerate(ticker):
    if '3 MONTH' in iticker:
        x = metalTicks[0].index(iticker)
        longTicker[i] = metalTicks[3][x]
        derivativeType[i] = 4

    elif 'CURNCY' in iticker:
        if 'KRWUSD CURNCY' in iticker:
            prices[i] = 1/prices[i]

        longTicker[i]  = iticker
        derivativeType[i] = 2
        tickerType[i] = 'FX'

    elif '_' in iticker:
        longTicker[i]  = bbgTick[iticker == sasTick]
        derivativeType[i] = 3
        tickerType[i] = 'Option'

modelList = set(modCode)
tickerlist = dict(zip(longTicker, derivativeType))
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • *mindreading on* Point 8 is because the list 'type' is already filled with `'Future'` on creation in matlab, in python he renamed it to 'tickerType' and fills it in the loop instead. `tickerType = ['Future']*len(ticker)` is therefore what is needed here. *mindreading off* Other than that, great answer! ;) – swenzel Jun 10 '15 at 12:59
  • @swenzei Thanks, fixed – TheBlackCat Jun 10 '15 at 13:03
  • This is absolutely perfect and exactly what I was hoping for. I'm very new to NumPy and python and really am not privy to what is the best way to go about things. Your code has sped things up to the point where it is faster than MATLAB, so thank you very much for that. I'll have to look up how you've actually done what you've done so that I can make my code faster in the future too – Lererferler Jun 10 '15 at 13:08
  • I have added a further simplified version that is probably faster still. – TheBlackCat Jun 10 '15 at 13:17
  • 2
    The important thing to keep in mind is that when you shouldn't write MATLAB code in Python. Although it is usually possible to directly translate MATLAB code into Python, efficient, well-written MATLAB code generally does not translate into efficient, well-written Python. It is much better to figure out exactly what you want to accomplish, and then figure out how to do it in a Python way. – TheBlackCat Jun 10 '15 at 14:32