0

I have below loop code and i have 60000 rows that is why my code is so slow. How can i make my code faster and more efficient?

My code :

import tensorflow as tf 
from itertools import product 
rst = [] 
for i, j in product(df2["Description"].to_list(), df1["Question"].to_list()) :    
  inputs = tokenizer([i], [j], return_tensors="np", max_length=512) 
  outputs = model(inputs)  
  start_position = tf.argmax(outputs.start_logits, axis=1)  
  end_position = tf.argmax(outputs.end_logits, axis=1) 
  answer = inputs["input_ids"][0, int(start_position) : int(end_position) + 1]  
  rst.append(tokenizer.decode(answer))
Omicron
  • 35
  • 6
gülsümmm
  • 55
  • 6
  • 3
    What are you _doing_ with the loop? After all, your `rst` won't have any indication of what `i` and `j` were..? What is `tokenizer`, for one? And `model`? – AKX Sep 02 '22 at 13:18
  • You can use `tqdm` with your loop. This won't speed up your loop but will help you to see the progress bar and remaining time for loop to run. `from tqdm.notebook import tqdm_notebook` and `for i, j in tqdm_notebook(product(df2["Description"].to_list(), df1["Question"].to_list()), desc = 'Progress')` – Huzaifa Arshad Sep 02 '22 at 13:19
  • 1
    Where is the time being consumed? Is it in *product()* or the core part of the loop? If it's in the main body of the loop then consider multiprocessing for better performance – DarkKnight Sep 02 '22 at 13:43

1 Answers1

0

I looked into itertools.product()'s runtime. It seems it runs in O(n*m) In the worst case, this ends up being inefficient at O(n^2), should the two lists have the same length.

Assuming you're using something like Jupyter notebook, maybe you can create the product first in a separate cell above the loop? That may help speed up the loop's cell.

So something like this:

prd = product(df2["Description"].to_list(), df1["Question"].to_list())

followed by the next cell:

for i, j in prd :    
...

Edit: Some more reasoning. I think it's the use of product() slowing your code down, and that's what you should focus on. If your 'Description' and 'Question' columns each have 60,000 rows and you apply product() using both of them as lists, it would return a list of 60,000 * 60,000 = 3,600,000,000 items.

3.6 billion items is a massive list for current-day computers to handle. If we try imitating this with the code below:

from itertools import product

a = [0] * 60000
b = [1] * 60000
c = product(a, b)

print(len(list(c)))

sure enough, my computer starts to struggle for memory space.

However, I think we can both agree my original solution isn't satisfactory, so I found this answer that better explains the difficulty of handling a list with several billion items and proposes a solution involving emulated lists. If you really need a list that big, I suggest looking into that or figuring out how to do it concurrently.

  • I do not get how this should affect the runtime at all since it only puts the iterator returned from `product` in a separate variable. Do I miss something? – LostAvatar Sep 02 '22 at 14:03
  • Assuming they're using a notebook, they can declare and assign the product in a separate cell from the loop. This may not reduce the time of the entire notebook, but when they run the cell with the loop, it doesn't need to perform product(), thereby saving time in that specific cell. The product() cell only needs to be run once beforehand. – Hamit Yuksel Sep 02 '22 at 14:10
  • Valid point. I'm still thinking that the Jupyter notebook assumption is rather arbitrary as not stated anywhere in the original question. But then, the opposite is not stated as well, so no reason to complain :D – LostAvatar Sep 02 '22 at 14:29
  • Yes, it does hinge on whether or not they're using a notebook. It doesn't have to be Jupyter's, but I feel it's common for notebooks to be used when working with TensorFlow, which is why I made that assumption :) – Hamit Yuksel Sep 02 '22 at 14:34
  • your code has the same performance with my code. I try to use. apply() function somewhere in my code to have better result but i couldn't do it.. @HamitYuksel – gülsümmm Sep 02 '22 at 20:23
  • @gülsümmm I added a better explanation of my answer as an edit. I hope this helps! – Hamit Yuksel Sep 08 '22 at 22:13