0

I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at eg the start or middle of the blocklist line.

Eg:

**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com
**Whitelist file**
google.com
**Output to new_blocklist**
randomsites.com
google.com.fake.com

Might not be a legit address above ie google.com.fake.com, but the example does demonstrate how I plan for this whitelist to work.

This line I've tried works, but is taking many minutes (on openwrt router) to process ~300k lines blocklist:

awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

This line here works on exact whole line matches only, but is very quick eg seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above!)?

awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile

Thanks everyone.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
Wizballs
  • 15
  • 5
  • 1
    Regarding `This line I've tried works` - no, it doesn't, you just haven't noticed it failing yet. I say that because it's doing a regexp comparison and so all of the `.`s in the whitelist addresses will match any character in the blocklist when you need them to be treated literally instead (e.g. it'd match `google.com` with `googlexcom`. It'd also match on substrings when you need it to only match on complete strings (e.g. it'd match `google.com` with `fakegoogle.com`). You need full word string matches but you're doing partial regexp matches, see https://stackoverflow.com/q/65621325/1745001. – Ed Morton May 06 '23 at 11:58
  • 1
    I probably should have said this is the closest I've gotten so far instead of 'works'. Thanks for the feedback though! – Wizballs May 06 '23 at 20:25
  • 1
    Also I didn't take into account fakegoogle.com scenario, as you pointed out – Wizballs May 06 '23 at 21:18
  • You're welcome. It'd be interesting if you could [edit] your question after testing to post the 3rd-run timing results for your existing script and the answers you got that you'd consider using. – Ed Morton May 07 '23 at 00:32

3 Answers3

2

Maybe instead of a lookup, you could assemble a pattern with an alternation once using | and group the whole expression between parenthesis and ending with $.

The dot matches any character, you would have to escape that to match a literal dot.

awk '
    FNR == NR {
      gsub(/\./, "\\.")
      tmp = tmp sep $0
      sep = "|" 
      next
    }
    FNR == 1 {
        regexp = "(^|[.])(" tmp ")$"
    }
    $0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thankyou also for taking the time to write/share this. I did try this script, and it did work as intended from my testing. Execution time was into the minutes for 300k blocklist, and 10 allow entries. Another script below as approx 10 seconds. Regardless, much appreciated any / all help I'm getting here. – Wizballs May 07 '23 at 08:00
1

This might do what you want:

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0]
    next
}
{
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print }

$ awk -f tst.awk allow block
randomsites.com
google.com.fake.com

The above is doing literal string hash lookups of each .-separate substring from your blocklist, starting from the right side, and so will be fast and robust. For a simple domain name in your blocklist like google.com it'll only do 1 lookup of the allow array, just like your !($0 in a) does, for others like google.com.fake.com it'll do 1 less iterations/lookups than there are parts of the domain, i.e. 4 parts in this case so 3 iterations/lookups, until if/when it finds a match in the allow array. Even for that, though, it's just hash lookup each time so it should still be fast.

P.S. old terminology for this was blacklist/whitelist, current is blocklist/allowlist rather than blocklist/whitelist.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Your method below seems the fastest execution, taking only about 10 seconds on a netgear r7800 against a blocklist size of nearly 300k entries, and 10 allow entries. I did change my question a bit to include the full blocklist syntax, after you mentioned the fakegoogle.com scenario. I did remove all the ^local=/ to test with. In my original script I was using awk 'NR==FNR{a[$0"/"];next} to add the "/" at the end of each allow line. I'm going to have a go at editing your script to account for my changes. – Wizballs May 07 '23 at 08:00
  • Please don't modify the input/output in your question after you have answers, just ask a new question if something was wrong in your original one and you need more help. I rolled back your question so the question+answers together make sense to others in future with a similar problem to the one we answered. – Ed Morton May 07 '23 at 13:34
  • 1
    Ah ok, understood no probs. I'll try and modify your script first before, posting a new question. Thankyou again. – Wizballs May 07 '23 at 19:42
  • Expanding on the original question a little if allowlist entries stay the same ie google.com but blocklist entries are recorded as eg (dnsmasq syntax) local=/google.com/ local=/fakegoogle.com/ local=/calendar.google.com/ local=/google.com.fake.com/ I was able to edit this method to: BEGIN { FS="." } NR==FNR { allow[$0"/"] next } { addr = $NF for ( i=NF-1; i>=1; i-- ) { addr = $i FS addr if ( substr(addr, 8) in allow ) { next } } } { print } – Wizballs May 14 '23 at 03:51
  • I feel like its getting closer, but not quite there, as now calendar.google.com is not removed from the list. Did try (lots!) to figure this out myself. Still learning awk however... Or would I be best of just starting a brand new question with new syntax? – Wizballs May 14 '23 at 05:50
  • Regarding [Expanding on the original question a little...](https://stackoverflow.com/questions/76188526/delete-lines-in-blocklist-file-where-the-end-of-those-lines-match-an-entry-in-a/76189090?noredirect=1#comment134456686_76189090) - please don't do that as [chameleon questions](https://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions) are strongly discouraged on this forum. If you have new requirements you want to add after you got answers to a question then simply ask a new question. – Ed Morton May 14 '23 at 11:56
  • 1
    I've made a bit of a mess off this thread. I'll go ahead and create a new question. Your advice/guidance has been very helpful. – Wizballs May 14 '23 at 18:57
1

The block-less ternaries-only awk approach, and escape more than OP's requirements :


mawk 'NR == FNR ? (__ = __$_ "|")<_ : $_!~(!_ < FNR \
      ? _ : substr(_, gsub("[?./_:;=&]", "[&]", __), 
                       sub(".$", ")$", __)))__' __='(' \

<( printf '%s' 'google.com') <( printf '%s' 'randomsites.com
                                             calendar.google.com
                                             google.com
                                             google.com.fake.com' )

 1  randomsites.com

 2  google.com.fake.com
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • Thanks so much for taking the time to write this. I looked for the mawk package in OpenWRT but it doesn't exist unfortunately. Only Awk & Gawk. I'm trying to stick with Awk for the time being so as to not add package dependencies. Even though Gawk does have advantages such as inline editing etc. – Wizballs May 07 '23 at 07:56
  • @Wizballs : then just use the same code with `gawk` – RARE Kpop Manifesto May 08 '23 at 02:37