1

I'm writing a program where I need to iterate through all the lines of a file. Pretty standard stuff. My concern is that because I want to strip the new line character from each line, I'm doing something like this:

for line in map(lambda s: s.strip("\n"), file_input.readlines()):
    do_something(line)

Does Python evaluate each time the value map(lambda s: s.strip("\n"), file_input.readlines())? My gut feeling is that Python is smarter than this, but any confirmation and/or reference would be very helpful to me!


I'm also assuming that the answer would also apply to list comprehension, for something like:

results = [x for x in list(set(self.database.values())) if x.startswith(text)]

where I hope it is not calculating list(set(self.database.values())) multiple times.

drinkmorewater
  • 103
  • 1
  • 7
  • I think using `file_input.readline()` instead of `readlines()` might be a better solution here, but just for the sake of discussion, will the above loop condition be evaluated multiple times? – drinkmorewater Mar 02 '20 at 21:54
  • 2
    No, because `map()` returns a [generator function](https://docs.python.org/3/howto/functional.html#generators) so its arguments are only evaluated once. Note that the `readlines()` call _is_ inefficient (or at least unnecessary). – martineau Mar 02 '20 at 21:55
  • It's not clear why you would even think the `list(set(self.database.values()))` is the list comp would even be executed more than once. – martineau Mar 02 '20 at 22:00
  • 2
    `map` returns an instance of `map`, which is an iterator, not a generator function. – chepner Mar 02 '20 at 22:03
  • @martineau Correct me if I'm wrong, but I'm using it to remove duplicates from `database.values()`. As per your question I didn't think this part was the list comp. I was really just asking for any list comprehension, `[x for x in expression_a if expression_b]`, will `expression_a` be executed more than once. – drinkmorewater Mar 02 '20 at 23:22
  • @martineau and thanks for pointing out about the `readlines()`. I'll keep that in mind. – drinkmorewater Mar 02 '20 at 23:23

2 Answers2

2

First, there's no need to call readlines; reading the entire file in to memory is the most inefficient part of this command.

However, the improved version

for line in map(lambda s: s.strip("\n"), file_input):
    do_something(line)

isn't significantly more inefficient than

for line in file_input:
    line = line.strip("\n")
    do_something(line)

It's just a more functional style, though there is an extra function call: we call a function that calls s.strip("\n") for us. We can avoid that with

for line in (x.strip("\n") for x in file_input):
    do_something(line)

or we can use the methodcaller class from the operator module:

for line in map(methodcaller("strip", "\n"), file_input):
    do_something(line)

which may be even a little more efficient, as we push more work down into the implementation of Python itself, rather than writing pure Python code to do the same thing.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Great! I wasn't aware of this usage of files. Please see my update of the second part of the question, I think the expression after the `for` is not evaluated multiple times either, right? – drinkmorewater Mar 02 '20 at 22:03
  • It's executed exactly once, then (implicitly) passed to `iter` to get the iterator that the `for` clause actually uses. – chepner Mar 02 '20 at 22:06
  • Thanks! That's exactly what I was looking for. – drinkmorewater Mar 02 '20 at 23:12
2

Let's promote the Levels of Efficiency :

The most efficient ( if not forbidden otherwise ) would be to remove "\n"-instances beforehand ( using smart and efficient O/S-tools ) and next process the "rest" of the file-I/O ( where python internally, by definition, appends "\n" again, once used in aFileINPUT-iterator, as noted in documentation, irrespective of os.filesep == { "\n" | "\r\n" | "\r" | ... } that was actually used for a "line"-separation step, on the iterator input-stream ).


Let's measure the Levels of Efficiency - by decoding the actual flow of operations :

On using map( lambda ) :

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL]
#                                           - local-GIL-lock
#                                           - local-CPU
#                                           - local-RAM-I/O :

>>> def a_map_lambda_loop( aFileINPUT ):
...     for line in map( lambda s: s.strip( "\n" ), aFileINPUT ):
...         do_something( line )

>>> dis.dis( a_map_lambda_loop )
  2           0 SETUP_LOOP              36 (to 39)
              3 LOAD_GLOBAL              0 (map)
              6 LOAD_CONST               1 (<code object <lambda> at 0x7ff8fee7b930, file "<stdin>", line 2>)
              9 MAKE_FUNCTION            0
             12 LOAD_FAST                0 (aFileINPUT)
             15 CALL_FUNCTION            2
             18 GET_ITER            
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_FAST               1 (line)

  3          25 LOAD_GLOBAL              1 (do_something)
             28 LOAD_FAST                1 (line)
             31 CALL_FUNCTION            1
             34 POP_TOP             
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK           
        >>   39 LOAD_CONST               0 (None)
             42 RETURN_VALUE        

On using @chepner-promoted loop :

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL]
#                                           - local-GIL-lock
#                                           - local-CPU
#                                           - local-RAM-I/O :

>>> def a_loop_runner( aFileINPUT ):
...     for line in aFileINPUT:
...         line = line.strip( "\n" )
...         do_something( line )

>>> dis.dis( a_loop_runner )
  2           0 SETUP_LOOP              39 (to 42)
              3 LOAD_FAST                0 (aFileINPUT)
              6 GET_ITER            
        >>    7 FOR_ITER                31 (to 41)
             10 STORE_FAST               1 (line)

  3          13 LOAD_FAST                1 (line)
             16 LOAD_ATTR                0 (strip)
             19 LOAD_CONST               1 ('\n')
             22 CALL_FUNCTION            1
             25 STORE_FAST               1 (line)

  4          28 LOAD_GLOBAL              1 (do_something)
             31 LOAD_FAST                1 (line)
             34 CALL_FUNCTION            1
             37 POP_TOP             
             38 JUMP_ABSOLUTE            7
        >>   41 POP_BLOCK           
        >>   42 LOAD_CONST               0 (None)
             45 RETURN_VALUE        

On using methodcaller() :

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL]
#                                           - local-GIL-lock
#                                           - local-CPU
#                                           - local-RAM-I/O :

>>> def a_methodcaller_loop( aFileINPUT ):
...     for line in map( methodcaller( "strip", "\n" ), aFileINPUT ):
...         do_something( line )

>>> dis.dis( a_methodcaller_loop )
  2           0 SETUP_LOOP              42 (to 45)
              3 LOAD_GLOBAL              0 (map)
              6 LOAD_GLOBAL              1 (methodcaller)
              9 LOAD_CONST               1 ('strip')
             12 LOAD_CONST               2 ('\n')
             15 CALL_FUNCTION            2
             18 LOAD_FAST                0 (aFileINPUT)
             21 CALL_FUNCTION            2
             24 GET_ITER            
        >>   25 FOR_ITER                16 (to 44)
             28 STORE_FAST               1 (line)

  3          31 LOAD_GLOBAL              2 (do_something)
             34 LOAD_FAST                1 (line)
             37 CALL_FUNCTION            1
             40 POP_TOP             
             41 JUMP_ABSOLUTE           25
        >>   44 POP_BLOCK           
        >>   45 LOAD_CONST               0 (None)
             48 RETURN_VALUE        

On using an ALAP .strip() call, if the .strip() was not possible to get deferred into the do_something(), and possibly distributed, for getting even higher efficiency of processing - { pure-[SERIAL] | just-[CONCURRENT] }, { local | independent }-GIL-lock(s), { local | distributed }-CPU, { local | distributed }-RAM-I/O:

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL] |+ just-[CONCURRENT]
#                                           - local-GIL-lock|+ independent-GIL-lock
#                                           - local-CPU     |+ independent-CPUs
#                                           - local-RAM-I/O |+ independent-RAM-I/O

>>> def ALAP_runner( aFileINPUT ):
...     for line in aFileINPUT:
...         do_something( line.strip( "\n" ) )

>>> dis.dis( ALAP_runner )
  2           0 SETUP_LOOP              33 (to 36)
              3 LOAD_FAST                0 (aFileINPUT)
              6 GET_ITER            
        >>    7 FOR_ITER                25 (to 35)
             10 STORE_FAST               1 (line)

  3          13 LOAD_GLOBAL              0 (do_something)
             16 LOAD_FAST                1 (line)
             19 LOAD_ATTR                1 (strip)
             22 LOAD_CONST               1 ('\n')
             25 CALL_FUNCTION            1
             28 CALL_FUNCTION            1
             31 POP_TOP             
             32 JUMP_ABSOLUTE            7
        >>   35 POP_BLOCK           
        >>   36 LOAD_CONST               0 (None)
             39 RETURN_VALUE        

More details are heavily dependent on the nature of the do_something() and the actual overhead-strict re-formulated Amdahl's Law costs (see all the add-on overhead costs and add to that also the process-communication costs ( pickle{ .dumps() | .loads() }-based SER/DES costs and IPC-{ channel | network }-communication latencies ), if going from a pure-[SERIAL] to a just-[CONCURRENT], the more if { process | node }-distributed.


On list-comprehension with an if-based member-allocator -pure-[SERIAL], local-GIL-lock, local-CPU, local-RAM-I/O ( awfully un-protected from on-the-fly syntax-constructors' un-salvageable memory-allocation MemoryError crashes ):

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL]
#                                           - local-GIL-lock
#                                           - local-CPU
#                                           - local-RAM-I/O :

>>> def anOnTheFlyGrowingListComprehension( self ):
...     res = [x for x in list(set(self.database.values())) if x.startswith(text)]

>>> dis.dis( anOnTheFlyGrowingListComprehension )
  2           0 BUILD_LIST               0
              3 LOAD_GLOBAL              0 (list)
              6 LOAD_GLOBAL              1 (set)
              9 LOAD_FAST                0 (self)
             12 LOAD_ATTR                2 (database)
             15 LOAD_ATTR                3 (values)
             18 CALL_FUNCTION            0
             21 CALL_FUNCTION            1
             24 CALL_FUNCTION            1
             27 GET_ITER            
        >>   28 FOR_ITER                27 (to 58)
             31 STORE_FAST               1 (x)
             34 LOAD_FAST                1 (x)
             37 LOAD_ATTR                4 (startswith)
             40 LOAD_GLOBAL              5 (text)
             43 CALL_FUNCTION            1
             46 POP_JUMP_IF_FALSE       28
             49 LOAD_FAST                1 (x)
             52 LIST_APPEND              2
             55 JUMP_ABSOLUTE           28
        >>   58 STORE_FAST               2 (results)
             61 LOAD_CONST               0 (None)
             64 RETURN_VALUE        

or
yet another, closer view on iterator-formulated pure-[SERIAL] "front"-end .strip()-er:

############################################################# EFFICIENCY LIMITS :
#                                           - pure-[SERIAL]
#                                           - local-GIL-lock
#                                           - local-CPU
#                                           - local-RAM-I/O :

>>> dis.dis( '( do_something( line.strip( "\n" ) ) for line in aFileINPUT )' )
          0 STORE_SLICE+0  
          1 SLICE+2        
          2 LOAD_CONST      24431 (24431)
          5 POP_JUMP_IF_TRUE 28015
          8 LOAD_NAME       26740 (26740)
         11 BUILD_MAP       26478
         14 STORE_SLICE+0  
         15 SLICE+2        
         16 IMPORT_NAME     28265 (28265)
         19 LOAD_NAME       29486 (29486)
         22 LOAD_GLOBAL     26994 (26994)
         25 JUMP_IF_TRUE_OR_POP  8232
         28 <34>           
         29 UNARY_POSITIVE 
         30 <34>           
         31 SLICE+2        
         32 STORE_SLICE+1  
         33 SLICE+2        
         34 STORE_SLICE+1  
         35 SLICE+2        
         36 BUILD_TUPLE     29295
         39 SLICE+2        
         40 IMPORT_NAME     28265 (28265)
         43 LOAD_NAME       26912 (26912)
         46 JUMP_FORWARD    24864 (to 24913)
         49 PRINT_EXPR     
         50 BUILD_MAP       25964
         53 PRINT_ITEM_TO  
         54 INPLACE_XOR    
         55 BREAK_LOOP     
         56 EXEC_STMT      
         57 IMPORT_STAR    
         58 SLICE+2        
         59 STORE_SLICE+1  
user3666197
  • 1
  • 6
  • 50
  • 92