29

In my method i have to return a list within a list. I would like to have a list comprehension, because of the performance since the list takes about 5 minutes to create.

[[token.text for token in document] for document in doc_collection]

Is there a possibility to print out the progress, in which document the create-process currently are? Something like that:

[[token.text for token in document] 
  and print(progress) for progress, document in enumerate(doc_collection)]

Thanks for your help!

rakael
  • 495
  • 1
  • 6
  • 16
  • 5
    Use a loop instead of the comprehension! – Klaus D. Jun 08 '18 at 08:04
  • @KlausD. For sure this would work, but is there no possibility to add it in the comprehension? Thanks anyway! – rakael Jun 08 '18 at 08:06
  • @KlausD. a `for` loop is way slower than `list comprehension` when creating lists – Gsk Jun 08 '18 at 08:06
  • 1
    @Chris_Rands Nice find. But I think this question is the better one (shorter and clearer, and no `pandas` usage), so it might be better to close the older question as a dupe of this one. We just need a "Don't do that; use a for loop instead" answer, then we're all set. – Aran-Fey Jun 08 '18 at 08:16
  • 1
    @Aran-Fey too late... But I can reopen and do this. done – Jean-François Fabre Jun 08 '18 at 08:17
  • @Aran-Fey "But I think this question is the better one " okay so why it has only one upvote (which isn't yours?) – Jean-François Fabre Jun 08 '18 at 08:18
  • @Aran-Fey I see your point, but I answered the other question with both the `print() or` and side function ideas and a bit more explanation than given in these answers, but I guess I'm biased *sigh* – Chris_Rands Jun 08 '18 at 08:19
  • 1
    @Jean-FrançoisFabre better doesn't mean good :p – Aran-Fey Jun 08 '18 at 08:19
  • @Chris_Rands you got some votes for that recently though ? :) – Jean-François Fabre Jun 08 '18 at 08:21
  • @Chris_Rands I do think that the answers in the older question are better; I just don't like the verbosity (and the `pandas` usage) of the question. This one here is easier to understand for a wider audience. Given the circumstances, I think nobody would blame you for re-posting your answer here :) – Aran-Fey Jun 08 '18 at 08:27
  • @Aran-Fey I guess the alternative would be to improve the other question by striping it back to the essential parts – Chris_Rands Jun 08 '18 at 08:55

6 Answers6

49

tqdm

Using the tqdm package, a fast and versatile progress bar utility

pip install tqdm
from tqdm import tqdm

def process(token):
    return token['text']

l1 = [{'text': k} for k in range(5000)]
l2 = [process(token) for token in tqdm(l1)]
100%|███████████████████████████████████| 5000/5000 [00:00<00:00, 2326807.94it/s]

No requirement

1/ Use a side function

def report(index):
    if index % 1000 == 0:
        print(index)

def process(token, index, report=None):
    if report:
        report(index) 
    return token['text']

l1 = [{'text': k} for k in range(5000)]

l2 = [process(token, i, report) for i, token in enumerate(l1)]

2/ Use and and or statements

def process(token):
    return token['text']

l1 = [{'text': k} for k in range(5000)]
l2 = [(i % 1000 == 0 and print(i)) or process(token) for i, token in enumerate(l1)]

3/ Use both

def process(token):
    return token['text']

def report(i):
    i % 1000 == 0 and print(i)

l1 = [{'text': k} for k in range(5000)]
l2 = [report(i) or process(token) for i, token in enumerate(l1)]

All 3 methods print:

0
1000
2000
3000
4000

How 2 works

  • i % 1000 == 0 and print(i): and only checks the second statement if the first one is True so only prints when i % 1000 == 0
  • or process(token): or always checks both statements, but returns the first one which evals to True.
    • If i % 1000 != 0 then the first statement is False and process(token) is added to the list.
    • Else, then the first statement is None (because print returns None) and likewise, the or statement adds process(token) to the list

How 3 works

Similarly as 2, because report(i) does not return anything, it evals to None and or adds process(token) to the list

ted
  • 13,596
  • 9
  • 65
  • 107
  • 2
    instead of using a `global i`, I would go with `enumerate` and pass the `index` to the `function` – Ma0 Jun 08 '18 at 08:14
  • 1
    @Ev.Kounis and factor out the reporting part using a callback, too (code edited to fix both points). – bruno desthuilliers Jun 08 '18 at 08:35
  • This is both slower and less readable than a `for` loop. In my (admittedly very limited) tests, Alex's solution takes 10 seconds, a `for` loop takes 13, and this one takes 17. – Aran-Fey Jun 08 '18 at 08:37
  • @Aran-Fey The functionality is different though; one cannot compare them directly. – Ma0 Jun 08 '18 at 08:42
  • @Ev.Kounis Huh? What's different? The result is the same, as far as I can tell... – Aran-Fey Jun 08 '18 at 08:43
  • You can turn the reporting on and off and there is an `if` statement for the `print`s. Alex's answer just `print`s everything. Not sure how your `for` loop looks like. – Ma0 Jun 08 '18 at 08:45
  • This is the standard method. But you really, really should use `if index % 1000 == 0 and index > 0:` in the test - it is much cleaner. :) – Björn Lindqvist Mar 15 '19 at 01:23
  • @ted, why do you pass `tok` to `report`? The function does nothing with this argument, so imo it can be omitted. – Qaswed Aug 19 '19 at 12:27
  • This is true, it's more of a display for how it could be used – ted Aug 20 '19 at 21:02
4
doc_collection = [[1, 2],
                  [3, 4],
                  [5, 6]]

result = [print(progress) or
          [str(token) for token in document]
          for progress, document in enumerate(doc_collection)]

print(result)  # [['1', '2'], ['3', '4'], ['5', '6']]

I don't consider this good or readable code, but the idea is fun.

It works because print always returns None so print(progress) or x will always be x (by the definition of or).

Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • 2
    This should NOT be the accepted answer - as far as I'm concerned, such a code will not pass a code review. Ted's solution is the correct way to solve the problem. – bruno desthuilliers Jun 08 '18 at 08:33
4

Just do:

from time import sleep
from tqdm import tqdm

def foo(i):
    sleep(0.01)
    return i

[foo(i) for i in tqdm(range(1000))]

For Jupyter notebook:

from tqdm.notebook import tqdm
noyk
  • 71
  • 5
2
def show_progress(it, milestones=1):
    for i, x in enumerate(it):
        yield x
        processed = i + 1
        if processed % milestones == 0:
            print('Processed %s elements' % processed)

Simply apply this function to anything you're iterating over. It doesn't matter if you use a loop or list comprehension and it's easy to use anywhere with almost no code changes. For example:

doc_collection = [[1, 2],
                  [3, 4],
                  [5, 6]]

result = [[str(token) for token in document]
          for document in show_progress(doc_collection)]

print(result)  # [['1', '2'], ['3', '4'], ['5', '6']]

If you only wanted to show progress for every 100 documents, write:

show_progress(doc_collection, 100) 
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
2

Here is my implementation.

pip install progressbar2

from progressbar import progressbar
new_list = [your_function(list_item) for list_item in progressbar(old_list)]`

You will see a progress bar while running the code block above.

James Chang
  • 608
  • 8
  • 21
0

I have the need to make @ted's answer (imo) more readable and to add some explanations.

Tidied up solution:

# Function to print the index, if the index is evenly divisable by 1000:
def report(index):
    if index % 1000 == 0:
        print(index)

# The function the user wants to apply on the list elements
def process(x, index, report):
     report(index) # Call of the reporting function
     return 'something ' + x # ! Just an example, replace with your desired application

# !Just an example, replace with your list to iterate over
mylist = ['number ' + str(k) for k in range(5000)]

# Running a list comprehension
[process(x, index, report) for index, x in enumerate(mylist)]

Explanation: of enumerate(mylist): using the function enumerate it is possible to have indices in addition to the elements of an iterable object (cf. this question and its answers). For example

[(index, x) for index, x in enumerate(["a", "b", "c"])] #returns
[(0, 'a'), (1, 'b'), (2, 'c')]

Note: index and x are no reserved names, just names I found convenient - [(foo, bar) for foo, bar in enumerate(["a", "b", "c"])] yields the same result.

Qaswed
  • 3,649
  • 7
  • 27
  • 47