I would like to be able to use a piped input or reference file of domains (file B) to remove each domain and it's subdomains from file A
I can't use grep "bbc.co.uk"
, for example, as this would include entries such as cbbc.co.uk
.
I have tried to use a while read loop to iterate through file B, running grep -E "^([^.\s]+\.)*${escaped_domain}$" fileA
to identify both domains and subdomains but this is very, very slow with the amount of comparisons required.
Is there a better way to do this? Perhaps using awk?
File B (or piped input)
~30k lines
bbc.co.uk
amazon.co.uk
doubleclick.net
File A
~150k+ lines
123123.test.bbc.co.uk
123434.rwr.amazon.co.uk
ads.bbc.co.uk
adsa.23432.doubleclick.net
amazon.co.uk
bbc.co.uk
cbbc.co.uk
damazon.co.uk
fsdfsfs.doubleclick.net
test.amazon.co.uk
test.bbc.co.uk
test.damazon.co.uk
Desired output:
cbbc.co.uk
damazon.co.uk
test.damazon.co.uk
Current method (different input with grep/regexps)
# Convert input: address=/test.com/ -> ^([^.\s]+\.)*test\.com$
regexList=$(cat fileB |
sed 's/\./\\./g' |
awk -F '/' {'print "^([^.\s]+\.)*"$2"$"'})
while read -r regex; do
grep -E $regex filaA
done <<< "$regexList"