0

I was trying to process a file in Python. Long story short, here are two versions of the code I wrote:

for line in file:
    if line[0:2] == ".I":
        #do something
    elif line[0:2] == ".T":
        #do something else
    elif line[0:2] == ".A":
         ......

There were some 21000 lines in the file. However when I altered my code to just this:

for line in file:
        if line[0] == ".":
            if line[1] == "I":
                #do something                   
            elif line[1] == "T":
                #do something
            elif line[1] == "A":
                ...

the runtime changed dramatically, I mean from 40 mins, to 30 seconds. I know list slicing is O(N), but in this case we were only slicing the first two characters in the string. So what caused it to change this dramatically?

user1234
  • 121
  • 6
  • 2
    Does most of your files do not start with `'.'`? – Dani Mesejo Oct 13 '21 at 19:53
  • 2
    This might help: https://stackoverflow.com/a/35181399/12229158 – whege Oct 13 '21 at 19:54
  • Are abstract and file the same length? If not then that's the problem. If they are then most likely the problem is because you are comparing two elements (0 and 1) in the first example and only a single element in the second example – VRComp Oct 13 '21 at 19:54
  • @DaniMesejo Most of it does not. But could you explain why that does affect the runtime? – user1234 Oct 13 '21 at 19:56
  • @user1234 Check this out instead of my original link: https://wiki.python.org/moin/TimeComplexity – whege Oct 13 '21 at 19:57
  • @VRComp Edited my post, my bad. I made a mistake – user1234 Oct 13 '21 at 19:58
  • 6
    Well in the first approach if a file does not start with a `"."` you do 3 comparisons (that involve 3 slices) and in the second approach you only do one comparison that involves no slicing – Dani Mesejo Oct 13 '21 at 19:59
  • 1
    Just curious: Did you measure the time with `line.startswith(...)`? – Matthias Oct 13 '21 at 20:01
  • 5
    The first question was a good one - what percentage of lines start with a "."? That's the difference. How many of these conditions do you have and what is the action to be taken. You may find that a dictionary with functions that implement the "do something" is the fastest. – tdelaney Oct 13 '21 at 20:08
  • BTW, `line` is not a list, it's a `str`. there is no list slicing going on here... – juanpa.arrivillaga Oct 13 '21 at 20:10
  • 1
    *I mean from 40 mins, to 30 seconds* Are these measurements accurate? – Dani Mesejo Oct 13 '21 at 20:21
  • @DaniMesejo More like estimates, but I think 30 mins to 1 min is more accurate. – user1234 Oct 13 '21 at 20:28
  • Can you run time python script.py to make sure that the time measurements are accurate? The time command will give you three outputs at the end of execution called real, user, sys. What is the real time? – VRComp Oct 13 '21 at 20:29
  • Do you have more comparison (more elif clauses) or just these ones? – Dani Mesejo Oct 13 '21 at 20:33
  • One other variable is the file system cache. If you run both scenarios back to back, the file has been read into the operating system cache on the second try. You can just cat the file to dev/null once beforehand so both runs have the same conditions. – tdelaney Oct 13 '21 at 20:45
  • You could do `result = collections.Counter(not line.startswith(".") or line[:2] for line in open('file.txt'))` to get a feel for distribution. The `True` entry will be all lines that don't start in ".", and the remainder a count for each value. – tdelaney Oct 13 '21 at 20:55

2 Answers2

1

Indexing is twice as fast as slicing, but this is a comparison of very small numbers. When run a million times, the difference is about .04 seconds. That's not the difference you see in your code.

>>> timeit("s[0:2]=='aa'", setup="s = '12345'")
0.08988943499571178
>>> timeit("s[0]=='a'", setup="s = '12345'")
0.05322081400663592
>>> timeit("val=='aa'", setup="val='aa'")
0.03722755100170616

You could speed up both cases a little by assigning the slice or index value to a variable once and using that for future comparisons. You could also do this in a function which is slightly faster referencing local variables.

Now to the bigger problem. Lets say you have 10,000 lines, and 1000 of them start with ".". And those lines are evenly distributed between ".A and .Z". You will check 23 different values on average. In the first case, thats 10000 * 23 or 230,000 total checks. In the second case, you eliminate most candidates with a single check, then the remaining with the average 23 checks. That's (9000) + (1000 * 23) or 32,000 total checks. A 86% reduction in conditions checked.

Lets go further. Suppose you have ".whatever" values that you aren't interested in. Each one of these has to go through all 26 checks before you realize its a dud. if that's the case, you can group all of your comparators into a set and check that first.

wanted = {".A", ".B", etc...)
for line in file:
    check = line[:2]
    if check in wanted:
        val = check[1]
        if ...
        

You can go even further if you can write your "do_something" code as functions.

def do_thing_A():
    pass
    
def do_thing_B():
    pass
    
def do_nothing():
    pass
    
do_all_the_things = {".A":do_thing_A, ".B":do_thing_B}

for line in file:
    do_all_the_things.get(line[:2], do_nothing)()
tdelaney
  • 73,364
  • 6
  • 83
  • 116
0

I'm looking more into the specifics of what is going on under the hood, but according to the Python Wiki, indexing has constant time-complexity (O(1)) whereas slicing has a complexity depending on the size of the slice, O(k).

whege
  • 1,391
  • 1
  • 5
  • 13
  • 1
    right, but the slices here are all small, of size 2. – juanpa.arrivillaga Oct 13 '21 at 20:08
  • 1
    True; my guess would be it's a combination of time complexity as well as the number of lines which pass OP's first condition. – whege Oct 13 '21 at 20:11
  • 3
    I don't think time complexity is relevant here. Technically, the time complexity is *constant time* since the slice is always the same size. On my machine, the difference between indexing into a string is about 33 nanoseconds, and slicing a length 2 string is about 90. So that's about 3 times slower, and it isn't scaling. Doesn't really explain the 80x difference – juanpa.arrivillaga Oct 13 '21 at 20:13