I have a column in pandas in which each element is a list of strings. The string elements inside the list are float. I need to select only the top three and the bottom three floats for each list.
for index,rows in tqdm(data.iterrows()):
s=rows['prob_tokens'].split(' ')
x=[float(elem) for elem in s]
x.sort()
high_sum=0
low_sum=0
try:
low_sum = math.log(x[0])+math.log(x[1])+math.log(x[2])
except:
low_sum=-10000000
try:
high_sum= math.log(x[-3])+math.log(x[-1])+math.log(x[-2])
except:
high_sum=-10000000
data.loc[index,'high_sum']=high_sum
data.loc[index,'low_sum']=low_sum
This is very inefficient and takes a lot of time for processing a file of 1M rows. Is there a faster way of doing this?
prob_tokens | high_sum | low_sum |
---|---|---|
0.028424 0.000922 0.037654 0.563366 0.99988 0.916362 0.356194 | -0.29 | -5.037 |
I found a solution for my problem and the code that I used is below.
def ffrow(data):
s=data.split(" ")
x = [float(elem) for elem in s]
x.sort()
high_sum = 0
low_sum = 0
try:
low_sum = math.log(x[0])+math.log(x[1])+math.log(x[2])
except:
low_sum = -10000000
try:
high_sum = math.log(x[-3])+math.log(x[-1])+math.log(x[-2])
except:
high_sum = -10000000
return high_sum,low_sum
def fastapply(df):
hs=[]
ls=[]
for i in range(0,len(df)):
h,l=ffrow(df.iloc[i]['prob_tokens'])
hs.append(h)
ls.append(l)
return hs,ls
hs=fastapply(dataxx)
dataxx['high_sum']=hs[0]
dataxx['low_sum']=hs[1]
df.apply() does not take advantage of vectorization hence, I used a loop on the index of the dataframe so that rows can be processed in a parallel way. Note that this is different from simply iterating over rows. I used timeit for calculating the performance of my code as compared to df.apply() on a pandas dataframe with 100k rows. df.apply took 103.8 seconds whereas looping on the index completed in 12.2 seconds. Cheers!