I have a code with the following while loop:
def get_names(key):
temp = pd.DataFrame(columns=['name','tot_count'])
df2 = dfx.loc[(dfx['name1'] == key) | (dfx['name2'] == key) | (dfx['name3'] == key) | (dfx['name4'] == key)]
for c in cl:
y = df2[[c,'tot_count']]
y.columns = ['name','tot_count']
temp = pd.concat([temp,y])
temp['key'] = key
return(temp)
cl = ['name1', 'name2', 'name3', 'name4']
n1 = len(s1) #s1 is the list of keywords
i = 0
d = pd.DataFrame(columns=['name','key'])
while(i < n1):
key = s1[i]
y = get_names(key)
d = pd.concat([d,y])
dfx = dfx.loc[(dfx['name1'] != key) & (dfx['name2'] != key) & (dfx['name3'] != key) & (dfx['name4'] != key)]
print(i)
i = i + 1
Here I have a thousands of keywords in the list s1
. The while loop processes each keyword in s1
as defined in the function get_names
where it uses the dataframe dfx
which has millions of rows and contains rows relevant to the each keywords. After a keyword has been processed, I no longer need those rows corresponding to this keyword in the main dataframe dfx
to be re-processed for another keyword. So I delete it inside the while loop.
Question: The run time of this code is about 2.5 hours. How can I improve the code the make it run faster on the same hardware?