How to remove dupe lines from 1 file and check each line against all files in same folder for other dupes?

Question

Got a hard one, I have 1/2TB of text files in a folder. I want to keep the text file names and not merge into 1 file.

How can I go through a text file and compare each line to all the rest of the other files?

Removing all the word dups for the entire directory.. etc until all done? Some of the files are large 38gb.

eg.

textfile1.txt has dupe word power

textfile2.txt also has this word power and needs to be removed etc...

Edit: all words are newline separated.

Untill finished all the files in that same dir. Either in linux or win.

Do you have one list file that you plan on checking against every other file or are you planning on checking every file against every other file? — James Brown, Aug 25 '16 at 12:47
ok, so use the first file to check against the other 500, once finished checking and remove all dupes, then begin second file and do same until all done.. — Hopelessone, Aug 25 '16 at 13:17

Ed Morton · Answer 1 · 2016-08-26T03:58:08.307

1

awk -i inplace '!seen[$0]++' *

The above used GNU awk 4.* for "inplace" editing. You'll need to have enough memory to make a copy of your largest file and keep a list of all unique words in memory. The above also assumes your "words" are newline separated since you didn't tell us anything otherwise.

If you don't have enough memory to copy your largest file, you could try something like:

for file in *
do
    while [ -s "$file" ]; do
        # copy the first 100 lines from "$file" into tmp
        head -n 100 "$file" > tmp

        # inplace remove the first 100 lines from "$file"
        count=$(head -100 "$file" |wc -c)
        dd if="$file" bs="$count" skip=1 of="$file"
        truncate -s "-$count" "$file"

        # somehow get a subset of words to check in tmp
        awk 'magic happens' tmp >> "${file}.new" &&
        rm -f tmp
    done
done

but you'll have to figure out how to come up with groups of words to check at a time (e.g. see below), this will be slow, tread carefully and make a backup of your files first!

If you CAN make a copy of each file but can't fit all of the "words" in memory at one time then you could do something like:

for a in {a..z}
do
   awk -v start="^$a" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
done

to look for groups of words based on some characteristics, e.g. the above looks for all words that start with a, then with b, etc. If those batches are too big, add an inner loop:

for a in {a..z}
do
   awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
   for b in {a..z}
   do
       awk -v start="^$a$b" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
   done
done

or more (to show the expanding regexp pattern):

for a in {a..z}
do
   awk -v start="^$a$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
   for b in {a..z}
   do
       awk -v start="^$a$b$" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
       for c in {a..z}
       do
           awk -v start="^$a$b$c" -i inplace -v IGNORECASE=1 '!($0~start && seen[$0]++)' *
       done
   done
done

The more nested loops the fewer words it'll process at a time and the slower it'll execute.

edited Aug 26 '16 at 03:58

answered Aug 25 '16 at 12:42

Ed Morton

188,023
17
78
185

can't largest file is 100gb.. – Hopelessone Aug 25 '16 at 13:18
The size of any given file doesn't stop you from doing it, the amount of memory you have available stops you. If you don't have enough memory available to make a copy of your largest file, you are screwed as you can't **really** edit a file "in place" in UNIX - all tools claiming to do so (awk, sed, perl, etc.) actually make a copy of the file internally. To truly edit a file in place you need to do something like I show at http://stackoverflow.com/a/17331179/1745001 and good luck with that for this application! – Ed Morton Aug 25 '16 at 13:24
what is this guy saying in the first answer here? http://stackoverflow.com/questions/32048401/python-removing-dupes-from-large-text-file – Hopelessone Aug 25 '16 at 13:29
He's saying you need to be able to make a copy of your largest file. All that script does is reproduce the input file in an output file, minus any duplicate lines, so worst case there are no duplicate lines and you end up with 2 identical files. It's like my answer above but it'd only work for a single input file. – Ed Morton Aug 25 '16 at 13:31
I got 4Gb of memory – Hopelessone Aug 25 '16 at 13:32
oh you mean hdd space? yep plenty – Hopelessone Aug 25 '16 at 13:33
Sorry, I have no idea what hdd space is. It's a simple thing - can you make a copy of your largest file or not? If you can, can you also save all "words" in memory (whatever that means in your OS)? If the answer to both is "yes" then you can do what you want as I show. If the answer to the first is "yes" but the 2nd "no" then there may be something you could do to analyze smaller batches of words at a time in a loop. If the answer to the first is "no" then there may be something you could do where you analyze each file in blocks and every time you cut/paste the block to be analyzed. – Ed Morton Aug 25 '16 at 13:38
hdd = hard drive space. I think "If the answer to the first is "yes" but the 2nd "no" then there may be something you could do to analyze smaller batches of words at a time in a loop." is what i'm after. as I have plenty of hard drive space but only 4GB of memory. – Hopelessone Aug 25 '16 at 13:41
OK, I edited my answer to provide a couple of options. – Ed Morton Aug 25 '16 at 13:54

How to remove dupe lines from 1 file and check each line against all files in same folder for other dupes?

1 Answers1