If you can't fit even the smaller file into memory, Python is not going to help. The usual solution is to sort the inputs and use an algorithm which operates on just three entries at a time (it reads one entry from one file and one from the other, then based on their sort order decides which file to read from next. It needs to keep three of them in memory at any time to decide which branch to take in the code).
GNU sort
will fall back to disk-based merge sort if it can't fit stuff into memory so it is basically restricted only by available temporary disk space.
#!/bin/sh
export LC_ALL=C # use trad POSIX sort order
t=$(mktemp -t listA.XXXXXXXX) || exit 123
trap 'rm -f $t' EXIT HUP INT
sort listA.txt >"$t"
sort listB.txt | comm -12 "$t" -
If the input files are already sorted, obviously comm
is all you need.
Bash (and I guess probably also Zsh and ksh
) offers process substitution like comm <(sort listA.txt) <(sort listB.txt)
but I'm not sure if that's robust under memory exhaustion.
As I'm sure you have already discovered, if the files are radically different size, it makes sense to keep the smaller one in memory regardless of your approach (so switch the order of listA.txt
and listB.txt
if listB.txt
is the smaller one, here and in your original grep
command line; though I guess it will make less of a difference here).