exclude regular expression and process very large files

Question

I have a text file that I need to correct. The words found in the file "exclude.txt" should be removed from original text.

original.txt

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tast" block-list:name="tart"/>
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="wark" block-list:name="wrok" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

The exclude file looks like this...

exclude.txt
tart
wrok

The expected output will look like this...

final.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

This grep command is working as expected.

grep -v -E 'tart|wrok' original.txt

This is OK if I have only 2 or 3 words in exclude file. But the problem is that both the original and exclude files have millions of words.

Update:

I forgot to mention that I have this line in original.txt

<block-list:block block-list:abbreviated-name="tart" block-list:name="test"/>

And I want to keep this line in original file because even if the wrong word "tart" is there, it is not in "block-list:name".

Update:

The include file has 15 million words compared to exclude file (15 thousand)

include.txt
test
work
table
total
exit

The awk and grep + sed commands are killed. I will prefer to use include file instead of exclude file (if possible).

See: [https://stackoverflow.com/q/19380925/3776858](https://stackoverflow.com/q/19380925/3776858) — Cyrus, Apr 09 '21 at 15:47
To not remove lines that contain strings such as `start`. See: `man grep` — Cyrus, Apr 09 '21 at 15:52
I'd suggest to use [ripgrep](https://github.com/BurntSushi/ripgrep) or tools like [hyperscan](https://github.com/intel/hyperscan) for better performance.. — Sundeep, Apr 09 '21 at 15:54
More than expected lines are getting excluded. I will check again and get back with -f option. — shantanuo, Apr 09 '21 at 15:58
awk solution is better than grep because I want to compare words only in "block-list:name". The placeholder is important. — shantanuo, Apr 09 '21 at 16:17

anubhava · Answer 1 · 2021-04-09T17:03:26.063

You may use this grep + sed solution in bash:

grep -vFf <(sed 's/.*/block-list:name="&"/' exclude.txt) original.txt

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

sed 's/.*/block-list:name="&"/' exclude.txt is used to wrap each word in exclude.txt with block-list:name="<word>"
grep -vFf is used to match all non-matching lines from original.txt with the patterns coming from a process substitution ``<(....)that runssed` command.

PS: Based on the edited question, this solution only ignore block-list:name="blocked-word" in original file.

This solution is better than awk because non-standard lines are preserved. And I do not need to learn awk to use this command. :) — shantanuo, Apr 10 '21 at 01:15
This works for exclude list. But if I replace the file with include list and remove -v flag, the command gets killed. This is because the include file has 15 million lines while exclude file has only 15 thousand. — shantanuo, Apr 10 '21 at 02:38

James Brown · Accepted Answer · 2021-04-09T16:50:41.173

Using awk and " a delimiter, so basically every even numbered field is a word (blabla"word"blalbla"another_word"...):

$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original

Output:

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

Edit: Just noticed I want to compare words only in "block-list:name". The placeholder is important in the commants so I changed the !($2 in a)&&!($4 in a) to !($4 in ). If the placement of block-list:name varies, use:

$ awk '
NR==FNR {                             # process the exclude file
    a[$1]                             # hash word
    next
}
{                                     # process the original file
    for(i=1;i<=NF;i++)                # loop every spave separated string
        if($i~/^block-list:name=/) {  # when we meet the desired string
            t=$i                      # copy string to  temp var
            gsub(/^[^"]+"|".*/,"",t)  # extract the word
            if(!(t in a))             # if the word is not to be excluded
                print                 # output record
            next                      # move the next record anyway
        }
}' exclude original

Is it possible to use include file (15 million lines) instead of exclude file (15 thousand)? — shantanuo, Apr 10 '21 at 02:39
I changed if(!(t in a)) to if((t in a)) while using include file instead of exclude. It worked as expected. — shantanuo, Apr 10 '21 at 04:27

exclude regular expression and process very large files

2 Answers2