1

I have a column in pandas in which each element is a list of strings. The string elements inside the list are float. I need to select only the top three and the bottom three floats for each list.

for index,rows in tqdm(data.iterrows()):
    s=rows['prob_tokens'].split(' ')
    x=[float(elem) for elem in s]
    x.sort()

    high_sum=0
    low_sum=0

    try:
        low_sum = math.log(x[0])+math.log(x[1])+math.log(x[2])
    except:
        low_sum=-10000000
    try:
        high_sum= math.log(x[-3])+math.log(x[-1])+math.log(x[-2])
    except:
        high_sum=-10000000
    data.loc[index,'high_sum']=high_sum
    data.loc[index,'low_sum']=low_sum

This is very inefficient and takes a lot of time for processing a file of 1M rows. Is there a faster way of doing this?

prob_tokens high_sum low_sum
0.028424 0.000922 0.037654 0.563366 0.99988 0.916362 0.356194 -0.29 -5.037

I found a solution for my problem and the code that I used is below.

def ffrow(data):
    s=data.split(" ")
    x = [float(elem) for elem in s]
    x.sort()
    high_sum = 0
    low_sum = 0
    try:
        low_sum = math.log(x[0])+math.log(x[1])+math.log(x[2])
    except:
        low_sum = -10000000
    try:
        high_sum = math.log(x[-3])+math.log(x[-1])+math.log(x[-2])
    except:
        high_sum = -10000000
    return high_sum,low_sum

def fastapply(df):
    hs=[]
    ls=[]

    for i in range(0,len(df)):
        h,l=ffrow(df.iloc[i]['prob_tokens'])
        hs.append(h)
        ls.append(l)
    return hs,ls

hs=fastapply(dataxx)
dataxx['high_sum']=hs[0]
dataxx['low_sum']=hs[1]

df.apply() does not take advantage of vectorization hence, I used a loop on the index of the dataframe so that rows can be processed in a parallel way. Note that this is different from simply iterating over rows. I used timeit for calculating the performance of my code as compared to df.apply() on a pandas dataframe with 100k rows. df.apply took 103.8 seconds whereas looping on the index completed in 12.2 seconds. Cheers!

2 Answers2

1

Using loops on a pandas dataframe is very slow, you should look for other methods wherever possible. See this question to learn more.

For this problem, create two functions for high_sum and low_sum then use df.apply to apply the data to the columns:

import math
import pandas as pd

def high_sum(x):
    try:
        return sum([math.log(i) for i in sorted(x)[-3:]])
    except:
        return -10000000
    
def low_sum(x):
    try:
        return sum([math.log(i) for i in sorted(x)[:3]])
    except:
        return -10000000

temp = ['0.028424', '0.000922', '0.037654', '0.563366', '0.99988', '0.916362', '0.356194']
df = pd.DataFrame([[temp]]*int(1e6))
df["high_sum"] = df[0].apply(lambda x: high_sum([float(i) for i in x]))
df["low_sum"] = df[0].apply(lambda x: low_sum([float(i) for i in x]))

This takes ~1s to run on my computer

Tom McLean
  • 5,583
  • 1
  • 11
  • 36
1
import numpy as np

def by_row(row):
    s = row["prob_tokens"].split(" ")
    x = [float(elem) for elem in s]
    x.sort()
    high_sum = 0
    low_sum = 0
    try:
        low_sum = np.log(x[:3]).sum()
    except:
        low_sum = -10000000
    try:
        high_sum = np.log(x[:-3]).sum()
    except:
        high_sum = -10000000
    row["low_sum"] = low_sum
    row["high_sum"] = high_sum
    return row

df["high_sum", "low_sum"] = np.NaN
df = df.apply(by_row, axis=1)

you just need to apply once :D

Frederick Zhang
  • 160
  • 1
  • 8
  • Thanks for your answer. Everything worked fine. I would like to suggest an edit though. x[:-3] selects all the elements of a list except the last three whereas I wanted to get only the last three so I changed it to x[-3:]. Another point I want to mention (albeit a bit irrelevant for normal users) is that np.log(0) does not throw an exception whereas math.log(0) does. Since, I wanted to use a fixed value of -1000000 if I encounter a 0, I changed np.log to math.log. – shubh gupta Jun 20 '21 at 20:42