0

I would like to be able to use a piped input or reference file of domains (file B) to remove each domain and it's subdomains from file A

I can't use grep "bbc.co.uk", for example, as this would include entries such as cbbc.co.uk.

I have tried to use a while read loop to iterate through file B, running grep -E "^([^.\s]+\.)*${escaped_domain}$" fileA to identify both domains and subdomains but this is very, very slow with the amount of comparisons required.

Is there a better way to do this? Perhaps using awk?

File B (or piped input)

~30k lines

bbc.co.uk
amazon.co.uk
doubleclick.net

File A

~150k+ lines

123123.test.bbc.co.uk
123434.rwr.amazon.co.uk
ads.bbc.co.uk
adsa.23432.doubleclick.net
amazon.co.uk
bbc.co.uk
cbbc.co.uk
damazon.co.uk
fsdfsfs.doubleclick.net
test.amazon.co.uk
test.bbc.co.uk
test.damazon.co.uk

Desired output:

cbbc.co.uk
damazon.co.uk
test.damazon.co.uk

Current method (different input with grep/regexps)

# Convert input: address=/test.com/ -> ^([^.\s]+\.)*test\.com$
regexList=$(cat fileB | 
    sed 's/\./\\./g' |
    awk -F '/' {'print "^([^.\s]+\.)*"$2"$"'})

while read -r regex; do
    grep -E $regex filaA
done <<< "$regexList"
mmotti
  • 341
  • 1
  • 3
  • 12
  • *I can't use grep "bbc.co.uk", for example, as this would include entries such as cbbc.co.uk* You can use `grep -Eo "\bbbc.co.uk\b` – Paolo Aug 18 '18 at 11:30
  • I have tried using word boundaries before. Unfortunately this matches things like `my-bbc.co.uk` making it a little over sensitive. – mmotti Aug 18 '18 at 11:44
  • @RavinderSingh13 the output should not include those domains as `doubleclick.net` is in `file B`. `adsa.23432.doubleclick.net` is a subdomain of `doubleclick.net`, so should not be outputted. – mmotti Aug 18 '18 at 11:55
  • We are easily confused. Naming your first input file B and your second one A is unnecessary obfuscation. It's hard enough to figure out what someone needs to do and how to help them without also having to remember that the **first** input file in the question is named `fileB`! Don't change it now of course or it'll make things worse but next time.... – Ed Morton Aug 18 '18 at 12:59

2 Answers2

2
$ awk '
    NR==FNR {
        gsub(/[^^]/,"[&]")
        gsub(/\^/,"\\^")
        doms["(^|[.])"$0"$"]
        next
    }
    {
        for (dom in doms) {
            if ($0 ~ dom) {
                next
            }
        }
        print
    }
' fileB fileA
cbbc.co.uk
damazon.co.uk
test.damazon.co.uk

or with a pipe:

$ cat fileB | awk '...' - fileA

If fileB is small enough then you don't need an array you can just build up and test 1 regexp for all domains:

$ awk '
    NR==1 { doms = "(^|[.])(" $0; next }
    NR==FNR {
        gsub(/[^^]/,"[&]")
        gsub(/\^/,"\\^")
        doms = doms "|" $0
        next
    }
    FNR==1 { doms = doms ")$" }
    $0 !~ doms
' fileB fileA
cbbc.co.uk
damazon.co.uk
test.damazon.co.uk

The 2 gsub()s in each script are ensuring that all regexp metacharacters in the domains are treated as literal characters instead. See is-it-possible-to-escape-regex-metacharacters-reliably-with-sed for details on why and how that works.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks - I am giving it a try; It does seem to take a long while before printing result in comparison to the use of`grep -E`. Is this to be expected when processing files of this size? – mmotti Aug 18 '18 at 14:17
  • Generally, no. The first one might take a BIT longer since it's looking through every domain in fileB one at a time for every line in fileA but it's looking for a simple regexp each iteration rather than looking for a complex regexp all at once so the time difference shouldn't be much, the second one shouldn't take any more time as it's doing exactly the same thing that grep -E does (with assumptions on my part about how you're using grep -E!). They should both print the FIRST result quickly either way. You sure you don't have it hung waiting for input? – Ed Morton Aug 18 '18 at 16:07
  • Definitely sure it's not waiting for input - It seems to be really slow with large input files. Is there any way that I can provide you with my `file A` and `file B` testing? I have included in the original post my current method; please note that the input file (B) for this method is in slightly different format to my desired format. – mmotti Aug 20 '18 at 10:58
  • Do you know it's not waiting for input because it does produce output or something else? No, I don't have time to do any more with it so I won't download files to test with, sorry. Have you tried your current shell-loop-with-greps and my awk script on a smaller file? I'd actually be shocked if that while-read-grep loop wasn't an order of magnitude **slower** than the awk script and it's fragile in several ways so I wouldn't be surprised if it isn't actually using all of your regexps and/or is otherwise producing incorrect output (see what @triplee **actually** suggested). – Ed Morton Aug 20 '18 at 12:22
1

You can transform the first file into a set of regular expressions for what to remove:

sed 's/[][\\.^$*+?()]/\\&/g;s/.*/^([^.]+\\.)*&$/' fileB

The output is a sequence of regular expressions you can pass to grep -vE:

... | grep -vEf - fileA

There are limits to how much grep -Ef can keep in memory in one go, but 30k expressions is probably within limits on modern hardware. In the worst case, split fileA in half and run the process twice.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    @EdMorton Thanks, hadn't seen that before, placement of backslash in a character class had to be changed. I thought I was being extra careful by doubling the backslash but that wasn't helping at all. This might still not be entirely portable. – tripleee Aug 18 '18 at 12:43
  • Looks like it'd be portable to me. The only thing I wonder about is whether or not you can really just escape all of those chars. In another question I participated in about escaping regexp metachars we ended up putting all chars except `^` inside bracket expressions and only escaped `^` (see https://stackoverflow.com/q/29613304/1745001) but I don't remember under what conditions just escaping them all was inadequate. Maybe it was just because we weren't assuming EREs and so had to worry about accidentally enabling ERE metachars? idk... – Ed Morton Aug 18 '18 at 12:46