0

I have a code with the following while loop:

def get_names(key):
    temp = pd.DataFrame(columns=['name','tot_count'])
    df2  = dfx.loc[(dfx['name1'] == key) | (dfx['name2'] == key) | (dfx['name3'] == key) | (dfx['name4'] == key)]
    for c in cl:
        y         = df2[[c,'tot_count']]
        y.columns = ['name','tot_count']
        temp      = pd.concat([temp,y])
    temp['key'] = key
    return(temp)

cl = ['name1', 'name2', 'name3', 'name4']
n1 = len(s1) #s1 is the list of keywords
i  = 0
d  = pd.DataFrame(columns=['name','key'])

while(i < n1):
    key = s1[i]
    y = get_names(key)
    d = pd.concat([d,y])
    dfx = dfx.loc[(dfx['name1'] != key) & (dfx['name2'] != key) & (dfx['name3'] != key) & (dfx['name4'] != key)]
    print(i)
    i = i + 1

Here I have a thousands of keywords in the list s1. The while loop processes each keyword in s1 as defined in the function get_names where it uses the dataframe dfx which has millions of rows and contains rows relevant to the each keywords. After a keyword has been processed, I no longer need those rows corresponding to this keyword in the main dataframe dfx to be re-processed for another keyword. So I delete it inside the while loop.

Question: The run time of this code is about 2.5 hours. How can I improve the code the make it run faster on the same hardware?

Stacker
  • 271
  • 2
  • 8

1 Answers1

1

d = pd.concat([d, y]) should almost nearly be used in a loop. This result in a quadratic execution due to new growing dataframes being created every time. You should use a list instead (see this post). The same applies to temp = pd.concat([temp,y]).

Additionally repeatedly make comparisons of the whole dataframe just to find a key is clearly inefficient. It is again due to a quadratic execution. Quadratic execution are a dead end for big data computation. This can be solved by using pre-computed dataframe parts using groupby stored in a dictionary where the keys are the searched key. With that, any fetch runs in constant time instead of a linear time. This is a bit come complex with 4 columns but the idea is the same: pre-computing operations by operating on groups instead than separate items or rows.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59