I have a massive file with each line being unique. I have a collection of smaller files (but still relatively large) where the lines are not unique. This collection is constantly growing. I need to add the small files into the big file and make sure there are no duplicates in the big file. Right now what I do is add all the files into one, and then run sort -u on it. However this ends up rescanning the entire big file which takes longer and longer as more files come in, and seems inefficient. Is there a better way to do this?
Asked
Active
Viewed 35 times
1 Answers
3
If the big file is already sorted, it would be more efficient to sort -u
only the smaller files, and then sort -u -m
(merge) the result with the big file. -m
assumes the inputs are already individually sorted.
Example (untested):
#!/bin/bash
# Merges unique lines in the files passed as arguments into BIGFILE.
BIGFILE=bigfile.txt
TMPFILE=$(mktemp)
trap "rm $TMPFILE" EXIT
sort -u "$@" > "$TMPFILE"
sort -um "$TMPFILE" "$BIGFILE" -o "$BIGFILE"
This answer explains why -o
is necessary.
If you like process substitutions you can even do it in a one-liner:
sort -um <(sort -u "$@") "$BIGFILE" -o "$BIGFILE"

dimo414
- 47,227
- 18
- 148
- 244