Replace array iteration with regex

Question

I want to find partially matching ipv6 prefixes in two arrays. For instance, 2001:db8: from one array will match 2001:db8:1::/48 and 2001:db8:2::/48 from another.

I already have it working by iterating one array other another:

ru_routes=( $(curl -4 ftp://ftp.ripe.net/ripe/stats/delegated-ripencc-latest | egrep -o '\|RU\|ipv6\|.+?::\|[0-9]+' | cut -d'|' -f4 | sed 's/::$/:/g') );
msk_ix_routes=( $(curl -4 http://www.msk-ix.ru/download/lg/msk_ipv6_pfx.txt.gz | gunzip | egrep -o '\b.*::/[0-9]*') );
routes=();
for item1 in ${msk_ix_routes[@]}; do
    for item2 in ${ru_routes[@]}; do
        if [[ $item1 = $item2* ]]; then
            routes+=( $item1 );
            break
        fi
    done
done

But it works kinda slow on my mips router (~90sec). I found this useful answer, which runs much faster but I cannot get it to work same way as the one above. And I don't think I need "if" construction as in example, because it will do the same thing twice. My not-working version:

msk=" ${msk_ix_routes[*]} ";         # add framing blanks

for item in ${ru_routes[@]}; do
  routes+=( egrep -o "$item[\S]*/g" <<< $msk );
done

I guess there are problems with quoting and escaping here, but I cannot solve it. Please help) I am open to suggestions.

Btw, I used "comm" in first version which runs even faster, but then it does exact match only, hence I started to play with loops:

routes=( $(comm -12 <(printf '%s\n' "${ru_routes[@]}" | LC_ALL=C sort) <(printf '%s\n' "${msk_ix_routes[@]}" | LC_ALL=C sort)) );

Unrelated to anything else you want to quote the `[@]` list expansions to prevent word splitting of the array elements (probably not an issue in your case but the right way to do things in general). — Etan Reisner, Aug 31 '14 at 23:48
What about those two non-working options is not working? What are they doing? (The second one looks like it will be creating an empty list since the `[[` test doesn't return any contents (only a return code). You almost certainly want that test in an `if` block and then to append `$item` to the list (like in the linked question). — Etan Reisner, Aug 31 '14 at 23:50
I agree about second option (removed it) First one gives me 889111 matches instead of 4xx valid matches. $item would be an exact match and I want to get all longer matches (substring) — Xand, Aug 31 '14 at 23:55

Dmitry Alexandrov · Accepted Answer · 2014-09-01T15:07:33.817

1

Bash scripts are not good in efficiency at all. Try this:

#!/bin/bash

# e. g.: ripencc|RU|ipv6|2001:640::|32|19991115|allocated -> ^2001:640:
awk -v FS='|' \
    '$2 == "RU" && $3 == "ipv6" { sub(/::/, ":", $4); print "^" $4 }' \
    <(curl -4 ftp://ftp.ripe.net/ripe/stats/delegated-ripencc-latest) \
|\
# grep e. g. '^2001:640:' in '2001:640:8000::/33'
grep --basic-regexp --file - \
    <(curl -4 http://www.msk-ix.ru/download/lg/msk_ipv6_pfx.txt.gz | gunzip)

edited Sep 01 '14 at 15:07

answered Sep 01 '14 at 00:06

Dmitry Alexandrov

1,693
12
14

Thanks comrade Dmitry, your last edit nailed it. And it is waay faster than loops. Any way together remove duplicated records coming from msk_ipv6_pfx.txt? For example 2a02:bc8::/32 and 2a02:bc8:fffe::/48. Both routes will go to IX, but higher prefix is enough (2a02:bc8::/32 is delegated). Thank you again – Xand Sep 01 '14 at 00:31
@Xand You’re welcome. :-) As for removing duplicating entries in `msk_ipv6_pfx.txt`, it’s looks non-trivial enough to be a separate question, actually here, on SO. But yes, of course, it’s possible. For instance: `$ tac msk_ipv6_pfx.txt | awk -F '::' -v P='^$' '!/^[# ]/ && $0 != "" && $1 !~ P { P = "^" $1; print }'`. That is not an optimal way, though – it can be accomplished without reversing a file. Do you need any comments? – Dmitry Alexandrov Sep 01 '14 at 01:26
I do see that number of lines reduced, but both prefixes mentioned above remains (would need to sleepover it anyway). But I would really appreciate hints on redirects to feed grep with msk-ix file after curl and gunzip. Otherwise I guess has to define it as a separate variable. Thank you – Xand Sep 01 '14 at 02:01
@Xand As for `curl`, if temporary files does not suit – see the edited answer. – Dmitry Alexandrov Sep 01 '14 at 14:35
@Xand As for removing redundancy of `msk_ipv6_pfx.txt`, well, I wasn’t attentive – actually `tac` is not enough, we have re-sort a file: `$ sort -n msk_ipv6_pfx.txt | awk -F '::' -v P='^$' '!/^[# ]/ && $0 != "" && $1 !~ P { P = "^" $1; print }'`. – Dmitry Alexandrov Sep 01 '14 at 14:41
It would look cleaner with temp files, but then it will increase flash wear out on a router. It has a busybox with built it grep, so had to remove "--basic-regexp". Working fine. So in the end I got one-liner which does the job in 17sec. Dmitry, you showed me something new and really saved a week of googling. Большое спасибо) – Xand Sep 01 '14 at 17:47
I checked the output more precisely and notices that awk part does not work as expected disregarding of sorting I choose. `2001:14e8:1::/48 2001:14e8:2::/48 2001:14e8::/32 2001:14e8::/48 Expected result is 2001:14e8::/32`, but it leaves only 2001:14e8::/48. Any guidelines? I don't really understand the !/^[# ]/ part. Thanks – Xand Oct 24 '14 at 14:09

Replace array iteration with regex

1 Answers1