3

I have 2 for loops, one after the other and I want to get rid of them somehow to improve code speed. My dataframe from pandas looks like this (headers represent different companies and the rows represent different users and 1 means that the user accessed that company, 0 otherwise):

   100  200  300  400
0    1    1    0    1
1    1    1    1    0

I want to compare each pair of companies in my dataset and for that, I created a list that contains all the ids of the companies. The code looks at the list takes the first company (base), then it pairs with every other company (peer), hence the second "for" loop. My code is the following:

def calculate_scores():
    df_matrix = create_the_matrix(df)
    print(df_matrix)
    for base in list_of_companies:
        counter = 0
        for peer in list_of_companies:
            counter += 1
            if base == peer:
                "do nothing"
            else:
                # Calculate first the denominator since we slice the big matrix
            # In dataframes that only have accessed the base firm
            denominator_df = df_matrix.loc[(df_matrix[base] == 1)]
            denominator = denominator_df.sum(axis=1).values.tolist()
            denominator = sum(denominator) - len(denominator)

            # Calculate the numerator. This is done later because
            # We slice up more the dataframe above by
            # Filtering records which have been accessed by both the base and the peer firm
            numerator_df = denominator_df.loc[(denominator_df[base] == 1) & (denominator_df[peer] == 1)]
            numerator = len(numerator_df.index)
            annual_search_fraction = numerator/denominator
            print("Base: {} and Peer: {} ==> {}".format(base, peer, annual_search_fraction))

EDIT 1 (added code explanation):

The metric is the following:

enter image description here

1) The metric that I am trying to calculate is going to tell me how many times 2 companies are searched together in comparison with all the other searches.

2) The code is first selecting all the users which have accessed the base firm (denominator_df = df_matrix.loc[(df_matrix[base] == 1)])line. Then it calculates the denominator which counts how many unique combinations between the base firm and any other searched firm by the user are there and since I can count the number of firms accessed (by the user), I can subtract 1 to get the number of unique links between the base firm and the other firms.

3) Next, the code filters the previous denominator_df to select only the rows which accessed the base and the peer firm. Since I need to count the number of users which accessed the base and the peer firm, I use the command: numerator = len(numerator_df.index) to count the number of rows and that will give me the numerator.

The expected output from the dataframe at the top is the following:

Base: 100 and Peer: 200 ==> 0.5
Base: 100 and Peer: 300 ==> 0.25
Base: 100 and Peer: 400 ==> 0.25
Base: 200 and Peer: 100 ==> 0.5
Base: 200 and Peer: 300 ==> 0.25
Base: 200 and Peer: 400 ==> 0.25
Base: 300 and Peer: 100 ==> 0.5
Base: 300 and Peer: 200 ==> 0.5
Base: 300 and Peer: 400 ==> 0.0
Base: 400 and Peer: 100 ==> 0.5
Base: 400 and Peer: 200 ==> 0.5
Base: 400 and Peer: 300 ==> 0.0

4) The sanity check to see if the code gives the correct solution: all the metrics between 1 base firm and all the other peer firms have to sum up to 1. And they do in the code I posted

Any suggestions or tips on which direction to go will be appreciated!

Adrian
  • 774
  • 7
  • 26
  • 1
    Please explain the logic behind instead and expected output – yatu Jan 25 '19 at 15:24
  • 1
    Sounds like you need to take the cartesian product of your `DataFrame` [This post](https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas/53699013#53699013) explains how to do that. If performance isn't a concern, I favor the simple syntax of assigning a dummy key and just merging that way. – ALollz Jan 25 '19 at 15:26
  • 1
    There are probably a couple ways to remove the nest for loop, but based on what you want to do with it i dont think you can improve the time complexity of it. If i'm understanding it correctly you are permanently stuck in an On^2 time complexity. My best suggestion is to try moving as much as you can outside of the for loops so you have less operations in each pass through. – Sam Jan 25 '19 at 15:29
  • 1
    With respect to your goal of removing the nested loop, it would be helpful to know why this is the desired goal, which could then be used to design the answer. If your goal is performance, you may consider that you are performing redundant logic on "base" (e.g. to calculate denominator). If you move this outside of the second for loop, you will see some time savings, since you are performing that same calculating |len(list_of_companies)| times. – drw Jan 25 '19 at 16:02
  • > Yatu, I have included more details about the code > ALollz, I will look into the cartesian product. Thanks for the suggestion. Unfortunately, performance is the issue here. > Sam, I will defintly try to move more stuff out of the main loop – Adrian Jan 25 '19 at 16:17
  • @DavidR.Winer I have move the part that calculates the denominator after the first for loop and achieved quite a increase in speed! Thank you! – Adrian Jan 25 '19 at 16:18

1 Answers1

4

You might be looking for itertools.product(). Here is an example that is similar to what you seem to want to do:

import itertools

a = [ 'one', 'two', 'three' ]

for b in itertools.product( a, a ):
    print( b )

The output from the above code snippet is:

('one', 'one')
('one', 'two')
('one', 'three')
('two', 'one')
('two', 'two')
('two', 'three')
('three', 'one')
('three', 'two')
('three', 'three')

Or you could do this:

for u,v in itertools.product( a, a ):
    print( "%s %s"%(u, v) )

The output is then,

one one
one two
one three
two one
two two
two three
three one
three two
three three

If you would like a list, you could do this:

alist = list( itertools.product( a, a ) ) )

print( alist )

And the output is,

[('one', 'one'), ('one', 'two'), ('one', 'three'), ('two', 'one'), ('two', 'two'), ('two', 'three'), ('three', 'one'), ('three', 'two'), ('three', 'three')]
DrM
  • 2,404
  • 15
  • 29