Scalable way of deleting all lines from a file where the line starts with one of many values

Question

Given an input file of variable values (example):

A
B
D

What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:

A
B
C
D

Would end up being:

The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.

Ed Morton · Answer 1 · 2013-07-11T15:12:53.770

3

awk '

    NR==FNR {     # IF this is the first file in the arg list THEN
        list[$0]  #     store the contents of the current record as an index or array "list"
        next      #     skip the rest of the script and so move on to the next input record
    }             # ENDIF

    {                                # This MUST be the second file in the arg list
        for (i in list)              # FOR each index "i" in array "list" DO
            if (index($0,i) == 1)    #     IF "i" starts at the 1st char on the current record THEN
                next                 #         move on to the next input record
     }

     1  # Specify a true condition and so invoke the default action of printing the current record.

' file1 file2

An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:

...
list = list "|" $0
...

and then doing an RE comparison:

...
if ($0 ~ list)
    next
...

but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.

If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:

awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2

edited Jul 11 '13 at 15:12

answered Jul 11 '13 at 14:10

Ed Morton

188,023
17
78
185

1

Hi @EdMorton This definitely works for OP's question and you have my +1 for that. But I have an unrelated question and sorry to be posting it in comments. The index function will remove partial matches. For example, if list hash has a key = month it will remove month, months, monthly etc and not just month. Is there a way to use this function and remove exact matches? I could do `awk 'NR==FNR{list[$0];next}!($1 in list)' f2 f1` but can it be done using `index` function? – jaypal singh Jul 12 '13 at 04:23
1

Good question! `index()` doesn't remove anything, it just finds the starting position of a sub-string in a string. There is no way to tell `index()` to find `the` when not in `then`, for example, since the ability to do that would require a regular expression and there are other functions for that, eg `match()`. If you know the exact space separator then you can include that in the string you're searching for, e.g. `index($0,"the ")`, but that's all. Other than that you need to use REs and `match()` or `/.../`. – Ed Morton Jul 12 '13 at 12:47
Going back to your `$1 in list` example, you COULD do `var="the"; index($1,var) && (length(var) == length($1))` to check for an exact string comparison, but then you'd be better off with simply `var == $1""` or even `$1 == "the"`. – Ed Morton Jul 12 '13 at 13:32

higuaro · Answer 2 · 2013-07-11T14:30:03.373

1

You can use comm to display the lines that are not common to both files, like this:

comm -3 file1 file2

Will print:

Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using

comm -3 <(sort file1) <(sort file2)

edited Jul 11 '13 at 14:30

answered Jul 11 '13 at 14:23

higuaro

15,730
4
36
43

You can add `--parallel=N` to the sort command so it will run N sorts concurrently OR you can preprocess the files spliting them in several chunks and run the sortings in parallel processes – higuaro Jul 11 '13 at 14:53
There are some proposals to speed up the sorting process in the following thread http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file – higuaro Jul 11 '13 at 14:56

score 1 · Answer 3 · answered Jul 11 '13 at 14:28

1

You can also achieve this using egrep:

egrep -vf <(sed 's/^/^/' file1) file2

Lets see it in action:

$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB

This would remove lines that start with one of the values in file1.

answered Jul 11 '13 at 14:28

devnull

118,548
33
236
227

Not sure this will scale to handle the OP's 100000 record input file – iruvar Jul 11 '13 at 14:34

Scalable way of deleting all lines from a file where the line starts with one of many values

3 Answers3