This snippet:
woord not in nlwoorden
will take O(N) for N = len(nlwoorden)
each time it is called.
So your list comprehension,
belangrijk=[woord for woord in tekst.split() if woord not in nlwoorden]
takes a total time of O(N * M) for M = len(tekst.split())
.
This is because nlwoorden
is a list, not a set. To test for membership in an unordered list, with a naive approach, you'd have to traverse the whole list in the worst case.
This is why your statement took a long time with a large input size.
If you have a hash set, instead, it'd take constant time to test for membership once the set is constructed.
So, in prototypical code form, something like this:
import io
def words(fileobj):
for line in fileobj: # takes care of buffering large files, chunks at a time
for word in line.split():
yield word
# first, build the set of whitelisted words
wpath = 'whitelist.txt'
wset = set()
with io.open(wpath, mode='rb') as w:
for word in words(w):
wset.add(word)
def emit(word):
# output 'word' - to a list, to another file, to a pipe, etc
print word
fpath = 'input.txt'
with io.open(fpath, mode='rb') as f:
for word in words(f): # total run time - O(M) where M = len(words(f))
if word not in wset: # testing for membership in a hash set - O(1)
emit(word)