I have a chain of for-loops that works on an original list of strings and then gradually filtering the list as it goes down the chain, e.g.:
import re
# Regex to check that a cap exist in string.
pattern1 = re.compile(r'\d.*?[A-Z].*?[a-z]')
vocab = ['dog', 'lazy', 'the', 'fly'] # Imagine it's a longer list.
def check_no_caps(s):
return None if re.match(pattern1, s) else s
def check_nomorethan_five(s):
return s if len(s) <= 5 else None
def check_in_vocab_plus_x(s,x):
# s and x are both str.
return None if s not in vocab else s+x
slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps
slist = [check_no_caps(s) for s in slist]
# filter no more than 5.
slist = [check_nomorethan_five(s) for s in slist if s is not None]
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]
The above is just an example and in reality my functions to manipulate the strings are more complicated but they do return the original string or a manipulated one.
I could use generators instead of list and do something like this:
slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
# filter with check_no_caps and no more than 5.
slist = (s2 check_no_caps(s1) for s1 in slist
for s2 in check_nomorethan_five(s1) if s1)
# filter in vocab
slist = [check_in_vocab_plus_x(s, str(i)) for i,s in enumerate(slist) if s is not None]
Or in one crazy nested generator:
slist = ['the', 'dog', 'jumps', 'over', 'the', 'fly']
slist = (s3 check_no_caps(s1) for s1 in slist
for s2 in check_nomorethan_five(s1) if s1
for s3 in check_in_vocab_plus_x(s2, str(i)) if s2)
There must be a better way. Is there a way to make the chain of for-loop faster?
Is there a way to do it with map
, reduce
and filter
? Will it be faster?
Imagine that my original slist is very very large like 10s of billions. And my functions are a not as simple as the functions above, they do some computation and do around 1,000 calls per second.