awk -i inplace '!seen[$0]++' *
The above used GNU awk 4.* for "inplace" editing. You'll need to have enough memory to make a copy of your largest file and keep a list of all unique words in memory. The above also assumes your "words" are newline separated since you didn't tell us anything otherwise.
If you don't have enough memory to copy your largest file, you could try something like:
for file in *
do
while [ -s "$file" ]; do
# copy the first 100 lines from "$file" into tmp
head -n 100 "$file" > tmp
# inplace remove the first 100 lines from "$file"
count=$(head -100 "$file" |wc -c)
dd if="$file" bs="$count" skip=1 of="$file"
truncate -s "-$count" "$file"
# somehow get a subset of words to check in tmp
awk 'magic happens' tmp >> "${file}.new" &&
rm -f tmp
done
done
but you'll have to figure out how to come up with groups of words to check at a time (e.g. see below), this will be slow, tread carefully and make a backup of your files first!
If you CAN make a copy of each file but can't fit all of the "words" in memory at one time then you could do something like:
for a in {a..z}
do
awk -v start="^$a" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
done
to look for groups of words based on some characteristics, e.g. the above looks for all words that start with a
, then with b
, etc. If those batches are too big, add an inner loop:
for a in {a..z}
do
awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
for b in {a..z}
do
awk -v start="^$a$b" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
done
done
or more (to show the expanding regexp pattern):
for a in {a..z}
do
awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
for b in {a..z}
do
awk -v start="^$a$b$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
for c in {a..z}
do
awk -v start="^$a$b$c" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
done
done
done
The more nested loops the fewer words it'll process at a time and the slower it'll execute.