How do I add a progress bar in this function?

Question

I have a text preprocessing function like this:

def preprocessing(text):
    
    text = text.lower()
    text = "".join([char for char in text if char not in string.punctuation])
    words = word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    words = [PorterStemmer().stem(word) for word in words]
    
    return words

And I am going to pass a dataframe in this function like this:

df['reviewText'] = df['reviewText'].apply(lambda x: preprocessing(x))

But the dataframe column has around 10000 reviews sentences, and the code taking too much time to complete. Is there any way to add a 'progress bar' so that I will have some understanding of time.

PS. If you want to try this on your local machine, the data can be found on this site.

Does this help https://stackoverflow.com/questions/43259717/progress-bar-for-a-for-loop-in-python-script ? — Greg, Oct 11 '21 at 10:25
@Greg I don't think so, my function do not have any for loop and I don't know how to implement that answer. I know it is too much to ask in SO, but can you tell me (in answer section) how to apply that `tqdm` in my preprocessing function? — Sticky, Oct 11 '21 at 10:31

score 3 · Accepted Answer · answered Oct 11 '21 at 10:58

3

Import TQDM and replace .apply() with .progress_apply():

from tqdm.auto import tqdm
tqdm.pandas()

df['reviewText'] = df['reviewText'].progress_apply(lambda x: preprocessing(x))

answered Oct 11 '21 at 10:58

Nils Werner

34,832
7
76
98

hah, didn't know there was a builtin... [on the dataframe] – 2e0byo Oct 11 '21 at 10:59
1

I don't think there is, it's mostly likely `tqdm.pandas()` [monkeypatching](https://en.m.wikipedia.org/wiki/Monkey_patch) it in. – Nils Werner Jan 22 '23 at 08:56
Yes, not sure what I was thinking there – 2e0byo Jan 22 '23 at 18:12

score 1 · Answer 2 · answered Oct 11 '21 at 10:53

If you want a progress bar you have to have a loop: a progress bar is by definition a loop. Fortunately you have one here in the apply. As a very quick trivial solution without ceasing to use apply, I would have the function update the progress bar as a side effect:

from tqdm import tqdm
t = tqdm(total=len(df.index))

def fn(x, state=[0]):
    preprocessing(x)
    state[0] += 1
    t.update(state[0])
df['reviewText'] = df['reviewText'].apply(fn)

t.close()

Whether this is clearer than writing the loop out explicitly is your call; I'm not sure it is.

(What's with the state=[0]? We're defining a muteable kwarg, which gets allocated once, for the fn, and then using it to keep track of state, as we have to manage state manually with this approach.)

Explicit loop

applied = []
for row in tqdm(df["reviewText"]):
    applied.append(preprocessing(row)

df["reviewText"] = applied

How do I add a progress bar in this function?

2 Answers2

Explicit loop