1

I need to process a very large CSV file.

During the process the first line needs some special attention.

So, the obvious code would be to check for the value in the csv-line. But that means a string-compare for every line (around 200.000)

Another option would be to set a boolean and let the boolean compare come first in an 'or' expression.

Both options are below:

import csv

def do_extra_processing():
    pass


def do_normal_processing():
    pass


if __name__ == "__main__":

    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        line_checked: bool = False

        for line in lines:
            # Check the first line: Option 1
            if line[1] == "SomeValue":
                # Every line of the 200000 lines does the string-compare
                do_extra_processing()
            do_normal_processing()

            # Check the first line: Option 2
            if (line_checked) or (line[1] == "SomeValue"):
                # Every line of the 200000 lines does the boolean-compare first and does not evaluate the string compare
                do_extra_processing()
                line_checked = True
            do_normal_processing()

I've checked that in an 'or' expression, the second part is not evaluated when the first part is True.

The boolean is initialized just above the for-loop and set in the if-statement when the extra_processing is done.

The question is: Is the second option with the bool-compare significantly faster?

(No need to convert to , so different question than 37615264 )

BertC
  • 2,243
  • 26
  • 33
  • Hello! Can you just explain how and where you will set the boolean? Boolean are always faster than string comparison – DueSouth Dec 22 '21 at 08:28
  • Thanks @Charley for the quick response. I've edited the question with explanation where the boolean is set. – BertC Dec 22 '21 at 08:33
  • Your boolean will stay True, as you do not reset it back to false, so extra processing is done only once. This seems not what you want ? So second option would not work, and it seems not possible to implement and supress the string compare. – Malo Dec 22 '21 at 08:44
  • The two options are not equivalent. The first one will extra-process every line matching the string comparison condition while the second one seems to only process the first line of the file. What exactly do you want to achieve? – Louis Lac Dec 22 '21 at 09:09
  • 1
    You say *"During the process the first line needs some special attention"*. Should you use `not line_checked and ...` instead of ``not line_checked or ...`` then? I thought the point of the boolean was to avoid the string comparison and the extra processing for all lines but the first, but that's not what your code with `or` does. – Kelly Bundy Dec 22 '21 at 09:56
  • @KellyBundy, I've added brackets in the expression. Now the expression is more clearly. It basically says: If the line has not been checked (boolean comparison), then do the string expression. – BertC Dec 22 '21 at 10:59
  • 1
    @BertC It was already clear what it does. What's not clear is why you're doing that. I mean, the effect of your bool is that in the first line you check the bool instead of comparing two strings, which saves a little, but that's at the expense of an **additional** bool check for **every** other line (you still do the string comparison for all those lines). That seems like an obviously bad idea and it's not clear why you'd even *consider* that. – Kelly Bundy Dec 22 '21 at 12:16
  • @KellyBundy, you write "you still do the string comparison for all those lines". Well, I don't think that is the case because I found out that in an 'or' situation with the first part evaluating to True, the second part is not evaluated. Just like in Javascript. That's why I can prevent the string comparison by putting a boolean comparison before it. – BertC Dec 22 '21 at 12:19
  • @BertC But your `line_checked` has become `True`, and thus your `not line_checked` is `False`. So then the `or` **does** evaluate the second part. – Kelly Bundy Dec 22 '21 at 12:21
  • @KellyBundy, you are absolutely right. This is a stupid mistake on my part and I'm sorry for wasting your time on this. I corrected the code by removing the 'not'. I'll pick up your suggested suggestions below and continue on that. Many thanks. – BertC Dec 22 '21 at 12:37
  • @BertC I don't think it's corrected. I rather suspect you made it *worse*. Now the remaining lines don't do the string comparison anymore, but they now all do the `do_extra_processing`. Is that what you want? – Kelly Bundy Dec 22 '21 at 12:47

3 Answers3

3

(Edit/note: This applies to what I think the OP's code is intended to do, not what it actually does. I've asked whether it's a bug like I suspect.)

What the original version does:

  1. Load line.
  2. Load 1.
  3. Load line[1].
  4. Load a string constant.
  5. Do a string comparison, resulting in a bool.
  6. Check the truth of a bool.

What the bool-optimized version does:

  1. Load line_checked.
  2. Check the truth of a bool.

Which is faster? Take a guess :-). But better still measure, you might find that neither matters, i.e., that both are much faster than the remaining actual processing per line.

Anyway, here are two ideas that need no extra work for the lines after the first:

  1. Separate code:
    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        for line in lines:
            if line[1] == "SomeValue":
                do_extra_processing()
            do_normal_processing()
            break

        for line in lines:
            do_normal_processing()
  1. Switch the processing function after the first line:
    with open('file.csv', newline='') as csvfile:
        lines = csv.reader(csvfile, delimiter=';')

        def process():
            if line[1] == "SomeValue":
                do_extra_processing()
            do_normal_processing()
            nonlocal process
            process = do_normal_processing
            
        for line in lines:
            process()

Not tested. The latter solution might need global instead of nonlocal if you keep that code block in the global space. Might be a good idea to put it in a function, though.

A little benchmark: If you have a bug as I suspect, and the bool is intended to avoid the string comparison and extra processing for all but the first line, then I get times like these:

11.5 ms  11.6 ms  11.6 ms  if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines
45.1 ms  45.3 ms  45.3 ms  if line[1] == "Somevalue": doesnt_happen_in_other_lines

Code (Try it online!):

from timeit import repeat

setup = '''
is_first_line = False
line = [None, "Othervalue"]
'''

statements = [
    'if is_first_line and line[1] == "Somevalue": doesnt_happen_in_other_lines',
    'if line[1] == "Somevalue": doesnt_happen_in_other_lines',
]

for _ in range(3):
    for stmt in statements:
        ts = sorted(repeat(stmt, setup))[:3]
        print(*('%4.1f ms ' % (t * 1e3) for t in ts), stmt)
    print()
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
1

Before further tests I would have advised to use the second version because we all know that testing a boolean is simpler that testing string equality.

Then I did what I advised @AidenEllis to do (Python 3.10 on Windows), and was kind of amazed:

timeit('x = "foo" if a == b else "bar"', '''a=True
b=False
''')
0.031938999999511
timeit('x = "foo" if a == b else "bar"', '''a=True
b=True
''')
0.032499900000402704
timeit('x = "foo" if a == b else "bar"', '''a="Somevalue"
b="Somevalue1"
''')
0.03237569999964762

Nothing really significant...

Then I tried:

timeit('x = "foo" if a else "bar"', 'a=True')
0.022047000000384287
timeit('x = "foo" if a else "bar"', 'a=False')
0.020898400000078254

Close to 30% faster, looks good...

And finaly

timeit('x = "foo" if (a or (b == c)) else "bar"', '''a=True
b="Somevalue"
c="Somevalue1"
''')
0.022851300000183983

Still significant but it means that testing a boolean is faster than comparing 2 values whatever the type of the values, even if they are boolean. Not really what I expected...

My conclusion is that we are playing on implementation details (the reason why I gave the Python version) and that the only sensible answer is it does not really matter: the gain if any should be negligible compared to the real processing time.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Hmm, actually I suspect they have a bug, should use `and` instead of `or`. But it's not entirely clear what they want to do, as also pointed out by Louis's question comment. Also, they use an `if` statement, which only does something in the true case, nothing in the false case. Whereas your conditional expressions do something in both cases. – Kelly Bundy Dec 22 '21 at 09:47
  • @KellyBundy: You are right, my last test was rather stupid... – Serge Ballesta Dec 22 '21 at 09:54
  • I've added another benchmark at the end of my answer, like I think their code is intended to work, and also including loading `line[1]`. The bool-optimized version takes only a quarter of the time. – Kelly Bundy Dec 22 '21 at 10:07
  • I wouldn't call your original last test "stupid", btw. It was rather equivalent to the OP's code, after all. You had `False` instead of their `True`, but you're also omitting their `not`, so that canceled out. Your updated code differs from the OP's. And as your original version and its timing showed, the OP's "optimization" makes it *slower*. It avoids a single string comparison, for the first line, at the expense of an extra bool check for every other line. That's rather obviously a bad idea, and it's one reason I highly doubt it is what they intended. – Kelly Bundy Dec 22 '21 at 10:30
  • One more thing: I'd say it's rather expected that testing a bool is faster than comparing two bools. Because the comparison results in a bool, and then that bool gets tested. So the comparison does strictly more work. I guess Python *could* check `==` by first checking the types and optimizing for certain types, but that would be extra cost for *every* comparison. And quite possibly might have an overall negative effect. (Though I wouldn't be totally surprised if Python actually did it, wouldn't be the first [optimization not visible in bytecode](https://stackoverflow.com/questions/69079181)). – Kelly Bundy Dec 22 '21 at 10:46
0

Might not be the Quick answer you're looking for but why don't you just compare the process time of both, doing both ways individually and then checking which finished faster.

Use this if you just want to quickly compare 2 different sets of code :

import time

start = time.perf_counter()

# do your processes here

finish = time.perf_counter()
total = finish - start
print(f"Process Time: {round(total * 1000, 2)}ms")

aight, back to it XD

  • In Python, you should not advise to use raw times to benchmark processing, because the standard library contains the `timeit` module which nicely repeats the operation a number of time (with a possible initialization phase) to limit the effect of external processes. Not downvoting because the benchmard advise is good, but using timeit would be better... – Serge Ballesta Dec 22 '21 at 08:57
  • @SergeBallesta OP said it's a "very large CSV file". I don't think `timeit` is very advantageous then, at least not for the reason you mention. – Kelly Bundy Dec 22 '21 at 09:09
  • Yeah, i also think timeit won't be necessary here. But i just found out timeit.default_counter() is more accurate. But you can keep it simple here. – Aiden Ellis Dec 22 '21 at 09:10
  • That answer edit just now actually made me laugh. Looking forward to Serge's reaction, it's like you're trolling them :-) – Kelly Bundy Dec 22 '21 at 09:10
  • @AidenEllis More accurate? Than `time.perf_counter()`? Are you sure you've read its [documentation](https://docs.python.org/3/library/timeit.html#timeit.default_timer)? :-P – Kelly Bundy Dec 22 '21 at 09:14
  • @KellyBundy let me know if you want me to delete this Answer cuz it makes you laugh so hard that my dead grandma starts laughing at me too. – Aiden Ellis Dec 22 '21 at 09:14
  • No, don't delete. It's not like I'm laughing at you. I'm laughing at your joke, taking Serge literally by using `timeit` but actually still doing the exact same thing. – Kelly Bundy Dec 22 '21 at 09:16
  • And like I said, I think it's ok the way it is. The `timeit` module has two kinds of repetitions, one is for when execution time is tiny (then it does it like a million times), which isn't the case here. The other might still be be good here, but I think it's comparable to doing your way a few times with a loop. Or running the whole script multiple times. Running the whole script multiple times could also be done with for example the [time command](https://en.wikipedia.org/wiki/Time_(Unix)), which could offer more/better information. – Kelly Bundy Dec 22 '21 at 09:28