1

I am have two dataframes,

df1 = pd.DataFrame({'a': [1.5, 2.5], 'b': [0.25, 2.75], 'c': [1.25, 0.75], 'd': [1.5, 2.5],'e': [0.25, 2.75], 'f': [1.25, 0.75]})

df2 = pd.DataFrame({'a': [1.5, 2.5,3.5,4.5], 'b': [0.25, 1.5, 2.5, 2.75], 'c': [1.25, 0.75, 3.5, 4.5], 'd': [1.5, 2.5, 3.5, 4.5],'e': [0.25, 2.75, 1.5, 3.5], 'f': [1.25, 0.75, 2.5, 4.5]})

For every row in df1, I want to find the distance of that row with all the rows of df2 for specific columns. After finding the distance, I want to find the minimum distance for that individual row among all and return the corresponding 'e' value of df2.

For eg, If I pass a and b columns, For each row of df1, I want to find distance between a and b for all the rows in df2 and find the minimum distance of all and get the corresponding 'e' value of df2.

I am using the following two functions,

def distance(x1, x2, L):
    start_time = time.time()
    dist = (np.sum((np.array(x1)-np.array(x2))**L))**(1/(float(L)))
    print("Time taken: " + str(round(time.time() - start_time,2)) + " seconds")
    return dist

def mindistance(data1,data2,variables,L):
    start_time = time.time()
    pred_values=[]
    test1=[]
    for index2, row2 in data2.iterrows():
        test=[]
        for index1, row1 in data1.iterrows():
            a=distance(row2[variables],row1[variables],L)
            test.append(a)
        #print(test)
        index=test.index(min(test))
        #print(index)
        b=round(data1['e'].iloc[index],2)
        pred_values.append(b)

    print(pred_values)
    print(len(pred_values))
    return "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"

print mindistance(df2, df1,['a','b'],2)

This functions are working fine. But there is a huge efficiency problem in this code. The distance part is taking a long time. Suppose if I have around 60000 iterations on the whole to be done based on my original dataframe it is taking more than one minute to compute it. I have tried line by line debugging and most of the time is taken in the a=distance(row2[variables],row1[variables],L) line. Can anybody help me in making the code efficient?

Paul H
  • 65,268
  • 20
  • 159
  • 136
haimen
  • 1,985
  • 7
  • 30
  • 53
  • have you profiled your code? what happens if you use `apply` instead of explicitly iterating through the dataframe? – Paul H Feb 04 '16 at 19:05
  • @PaulH I am sorry I am new to python. Not sure what is profiling and dont know how to use apply function? Can you help me in doing that? – haimen Feb 04 '16 at 19:06
  • profiling python code: http://stackoverflow.com/questions/3927628/how-can-i-profile-python-code-line-by-line – Paul H Feb 04 '16 at 19:06
  • using `apply` http://pandas.pydata.org/pandas-docs/version/0.17.1/basics.html#row-or-column-wise-function-application – Paul H Feb 04 '16 at 19:08
  • 1
    It's because your code is inherently long: for each row, you have to sample as other rows (`O(n)` time complexity, which means the running time for each row is proportional to the size of the dataframe), meaning your total running time is `O(n^2)`. Since row1 checking row2 == row2 checking row1, you can remove redundancy and get `O(n*log(n))` time. Using vectorized code, such as `apply`, will also help a lot. But remember: your code is profiling a gigantic amount of data. Minimizing the search space will be the predominant factor in shortening runtimes. – Alex Huszagh Feb 04 '16 at 19:15
  • @AlexanderHuszagh Thanks a lot for the excellent explanation. I can understand the problem now. But I am not sure how to use apply instead of iterators. Is it possible for you to edit the code? If I am asking too much, can you give a single statement of how to go about ? – haimen Feb 04 '16 at 19:19
  • @PaulH Thanks for the links. In that apply related link, I could not find any details of how to use apply instead of loops. Since I am new , I am not able to figure that out. Can you help me a bit in editing the code? – haimen Feb 04 '16 at 19:20

1 Answers1

3

Here is how you can rewrite your mindistance function using apply.

def mindistance(data1,data2,variables,L):
    start_time = time.time()
    pred_values=[]
    test1=[]
    for index2, row2 in data2[variables].iterrows():
        test=list( data1[variables].apply( distance, args=(row2,L,), axis=1 ) )
        index=test.index(min(test))
        #print index
        b=round(data1['e'].iloc[index],2)
        pred_values.append(b)

    #print pred_values
    #print len(pred_values)
    return "Time taken: " + str(round(time.time() - start_time,2)) + " seconds"

This will bring substantial improvement. And you can go further by turning the other loop into a call to apply too.

innoSPG
  • 4,588
  • 1
  • 29
  • 42