I've come across this behaviour recently, and I am a little confused as to why it happens - my initial assumption is that sort of optimisation is going on when calling a function rather than when running a statement.
The example: Let's start with a simple example:
somestring="climate change is a big problem. However emissions are still rising"
sometopics=["climate","change","problem","big","rising"]
Assume we have a list of strings, similar to "somestring" above, and we also have a list of topics, like sometopics.
We would like to compare whether any of the "sometopics" exist in "somestring" and importantly return those that do to a new list.
with a list comprehension statement we can do it like this for one string:
result = [element for element in sometopic if(element in somestring)]
on my machine however, a function definition as below, runs about 20-30% faster than the statement above.
def comparelistoftopicstokw(mystring,somelistoftopics):
result = [element for element in somelistoftopics if(element in mystring)]
return result
Why does this happen?
is it always the case that a function will be faster than an equivalent statement / list of statements?
EDIT****
See below Minimum viable reproducable notebook example:
import pandas as pd, numpy as np
columns_df = pd.DataFrame({"Keyword":['fish soup','katsu','soup']}) # Compute a Pandas dataframe to write into 500kcolumns
somestring="pad thai is a good recipe. It is cooked with chicken or lamb or beef"
sometopics=["chicken","pad thai","recipe","lamb","beef"]
print(len(sometopics))
somebigtopics=sometopics*100000
def extractsubstrings(inputstring,alistofpossibletopics):
#obvious very slow for loop
topicslist=[]
print(inputstring)
for topic in alistofpossibletopics:
if str(topic) in inputstring:
topicslist.append(str(topic))
%%time
def listcompinlists(mystring,bigtopic):
res = [ele for ele in bigtopic if(ele in mystring)]
return res
%%time
res = [ele for ele in somebigtopics if(ele in somestring)]
%%time
x=extractsubstrings(somestring,somebigtopics)
%%time
funcres=listcompinlists(somestring,somebigtopics)
On my machine (ubuntu 18.04, Python 3.6), the list comprehension is executed for the above case in 22-24ms, while the function executes in 18-21 ms. its not a huge difference, but if you have 10 million rows to process for example thats a fair few hours saving
TLDR Performance comparison:
extractsubstrings=Wall time: 122 ms
list comprehension statement: Wall time: 24.5 ms
listcompinlists=Wall time: 18.6 ms