You've got a few issues, though only the first is really nasty:
- (By far the biggest cost in all likelihood) Checking for membership in a
list
is O(n)
; for a 70K element list
, that's a lot of work. Make it a set
/frozenset
and lookup is typically O(1)
, saving thousands of comparisons. If the types are unhashable, you can pre-sort
the selected_list
and use the bisect
module to do lookups in O(log n)
time, which would still get a multiple order of magnitude speedup for such a large list
.
- If your lines are large, with several runs of whitespace, splitting at all points wastes time; you can specify
maxsplit
to only split enough to get the ID
- If the IDs are always integer values it may be worth the time to make
selected_id
store int
instead of str
and convert on read so the lookup comparisons run a little faster (this would require testing). This probably won't make a major difference, so I'll omit it from the example.
Combining all suggestions:
selected_id = frozenset(... Your original list of 70k+ str elements ...)
for line in in_fp: # input file: 10M+ lines
id, _ = line.split(None, 1) # id (str type), such as '10000872820081804'
if id in selected_id:
out_fp.write(line)
You could even convert the for
loop to a single call with a generator expression (though it gets a little overly compact) which pushes more work to the C layer in CPython, reducing Python byte code execution overhead:
out_fp.writelines(x for x in in_fp if x.split(None, 1)[0] in selected_id)