1

I have a text preprocessing function like this:

def preprocessing(text):
    
    text = text.lower()
    text = "".join([char for char in text if char not in string.punctuation])
    words = word_tokenize(text)
    words = [word for word in words if word not in stopwords.words('english')]
    words = [PorterStemmer().stem(word) for word in words]
    
    return words

And I am going to pass a dataframe in this function like this:

df['reviewText'] = df['reviewText'].apply(lambda x: preprocessing(x))

But the dataframe column has around 10000 reviews sentences, and the code taking too much time to complete. Is there any way to add a 'progress bar' so that I will have some understanding of time.

PS. If you want to try this on your local machine, the data can be found on this site.

Sticky
  • 151
  • 1
  • 10
  • Does this help https://stackoverflow.com/questions/43259717/progress-bar-for-a-for-loop-in-python-script ? – Greg Oct 11 '21 at 10:25
  • @Greg I don't think so, my function do not have any for loop and I don't know how to implement that answer. I know it is too much to ask in SO, but can you tell me (in answer section) how to apply that `tqdm` in my preprocessing function? – Sticky Oct 11 '21 at 10:31
  • You have 3 `for` loops, they've been shorten to 1 liners. – Greg Oct 11 '21 at 10:49

2 Answers2

3

Import TQDM and replace .apply() with .progress_apply():

from tqdm.auto import tqdm
tqdm.pandas()

df['reviewText'] = df['reviewText'].progress_apply(lambda x: preprocessing(x))
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
1

If you want a progress bar you have to have a loop: a progress bar is by definition a loop. Fortunately you have one here in the apply. As a very quick trivial solution without ceasing to use apply, I would have the function update the progress bar as a side effect:

from tqdm import tqdm
t = tqdm(total=len(df.index))

def fn(x, state=[0]):
    preprocessing(x)
    state[0] += 1
    t.update(state[0])
df['reviewText'] = df['reviewText'].apply(fn)

t.close()

Whether this is clearer than writing the loop out explicitly is your call; I'm not sure it is.

(What's with the state=[0]? We're defining a muteable kwarg, which gets allocated once, for the fn, and then using it to keep track of state, as we have to manage state manually with this approach.)

Explicit loop

applied = []
for row in tqdm(df["reviewText"]):
    applied.append(preprocessing(row)

df["reviewText"] = applied
2e0byo
  • 5,305
  • 1
  • 6
  • 26