0

I have a string like this one below (nvram extract) that is used by tinc VPN to define the network hosts:

1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>

I'm trying to extract from the above:

host1.network.org host2.network.org host3.network.org

The hostname and keys are made up, but the structure of the input string is accurate. By the way the end node could be as well be defined as an IP addresses, so I'm trying to extract what's in between the second occurrence of "<" and the first occurrence of "<<". Since this is a multi match the occurrences are counted after either beginning of the line or the ">" character. So the above could be read as follow:

1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>

1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>

0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>

As I need this info in a shell script I guess I would need to store each host/IP as an emlement of an array.

I have used regexp online editors, and managed to work out this string:

^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)

however is I run a

grep -Eo '^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)'

against the initial stinge I get the full string in return so I must be doing something wrong :-/

P.S. running on buysbox: `BusyBox v1.25.1 (2017-05-21 14:11:58 CEST) multi-call binary.

Usage: grep [-HhnlLoqvsriwFE] [-m N] [-A/B/C N] PATTERN/-e PATTERN.../-f FILE [FILE]...

Search for PATTERN in FILEs (or stdin)

    -H      Add 'filename:' prefix
    -h      Do not add 'filename:' prefix
    -n      Add 'line_no:' prefix
    -l      Show only names of files that match
    -L      Show only names of files that don't match
    -c      Show only count of matching lines
    -o      Show only the matching part of line
    -q      Quiet. Return 0 if PATTERN is found, 1 otherwise
    -v      Select non-matching lines
    -s      Suppress open and read errors
    -r      Recurse
    -i      Ignore case
    -w      Match whole words only
    -x      Match whole lines only
    -F      PATTERN is a literal (not regexp)
    -E      PATTERN is an extended regexp
    -m N    Match up to N times per file
    -A N    Print N lines of trailing context
    -B N    Print N lines of leading context
    -C N    Same as '-A N -B N'
    -e PTRN Pattern to match
    -f FILE Read pattern from file`

Thanks!

rs232
  • 19
  • 6
  • Could extracting all host names like [this example at regex101](https://regex101.com/r/71ErS1/1) do it for you? – SamWhan Jun 12 '17 at 16:39
  • This is another example of it work online but not with grep in Busybox: grep -Eo '\w*[a-z]\w*(?:\.\w*[a-z]\w*)+' grep: bad regex '\w*[a-z]\w*(?:\.\w*[a-z]\w*)+': Invalid preceding regular expression – rs232 Jun 13 '17 at 07:35

2 Answers2

0

The regex you have is based on capturing groups and with grep you can only get full matches. Besides, you use -E (POSIX ERE flavor), while your regex is actually not POSIX ERE compatible as it contains lazy quantifiers that are not supported by this flavor.

I think you can extract all non-< chars between < and << followed with a digit and then a < with a PCRE regex (-P option):

s='1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>'
echo $s | grep -oP '(?<=<)[^<]+(?=<<[0-9]<)'

See the regex demo and a grep demo.

Output:

host1.network.org
host2.network.org
host3.network.org

Here, (?<=<) is a positive lookbehind that only checks for the < presence immediately to the left of the current location but does not add < to the match value, [^<]+ matches 1+ chars other than < and (?=<<[0-9]<) (a positive lookahead) requires <<, then a digit, and then a < but again does not add these chars to the match.

If you have no PCRE option in grep, try replacing all the text you do not need with some char, and then either split with awk, or use grep:

echo $s | \ 
   sed 's/[^<]*<[^<]*<\([^<][^<]*\)<<[0-9]<[^<]*<<[^<]*[<>]*/|\1/g' | \ 
    grep -oE '[^|]+'

See another online demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Great explanation thanks! What I forget to say is: I'm running this on busybox e.g. grep doesn't have any "P" option – rs232 Jun 12 '17 at 17:02
0

OK, no response to my comment so I'll enter it as answer. How about

\w*[a-z]\w*(\.\w*[a-z]\w*)+

It matches at least two parts of a fully qualified name, separated by a dot.

grep -Eo '\w*[a-z]\w*(\.\w*[a-z]\w*)+'

yields

host1.network.org

host2.network.org

host3.network.org

(assuming your string is entered in stdin ;)

Community
  • 1
  • 1
SamWhan
  • 8,296
  • 1
  • 18
  • 45