3

I have a big email file with the following random hosts:

......
HOSTS: test-host,host2.domain.com,
host3.domain.com,another-testing-host,host.domain.
com,host.anotherdomain.net,host2.anotherdomain.net,
another-local-host, TEST-HOST

DATE: August 11 2015 9:00
.......

The hosts are always delimited with a comma but they can be split on one, two or multiple lines (I can't control this, it's what email clients do, unfortunately).

So I need to extract all the text between the string "HOSTS:" and the string "DATE:", wrap it, and replace the commas with new lines, like this:

test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST

So far I came up with this, but I lose everything that's on the same line with "HOSTS":

sed '/HOST/,/DATE/!d;//d' ${file} | tr -d '\n' | sed -E "s/,\s*/\n/g"
Tony
  • 460
  • 1
  • 7
  • 20
  • 1
    Your bug is that `//` does not match only blank lines as you are (I think) presuming. Use `/^$/d` or `/./!d` instead. You end up with more text than you want, but I assume you can take it from there... – Jeff Y Jun 24 '16 at 14:14

7 Answers7

7

Something like this might work for you:

sed -n '/HOSTS:/{:a;N;/DATE/!ba;s/[[:space:]]//g;s/,/\n/g;s/.*HOSTS:\|DATE.*//g;p}' "$file"

Breakdown:

-n                       # Disable printing
/HOSTS:/ {               # Match line containing literal HOSTS:
  :a;                    # Label used for branching (goto)
  N;                     # Added next line to pattern space
  /DATE/!ba              # As long as literal DATE is not matched goto :a
  s/.*HOSTS:\|DATE.*//g; # Remove everything in front of and including literal HOSTS:
                         # and remove everything behind and including literal DATE 
  s/[[:space:]]//g;      # Replace spaces and newlines with nothing
  s/,/\n/g;              # Replace comma with newline
  p                      # Print pattern space
}
Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
  • I'm marking this as the correct answer. I tested it, it works and I also like the explanation, but I will go with my own method, which now works thanks to Jeff Y's suggestion. – Tony Jun 24 '16 at 16:54
3

another awk with tr

$ awk '/^HOSTS:/{$1="";p=1} /^DATE:/{p=0} p' file | tr -d ' \n' | tr ',' '\n'; echo ""

test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST
karakfa
  • 66,216
  • 7
  • 41
  • 56
2

this awk one-liner may help:

awk -v RS='HOSTS: *|DATE:' 'NR==2{gsub(/\n/,"");gsub(/,/,"\n");print}' input
Kent
  • 189,393
  • 32
  • 233
  • 301
2
cat ${file} | awk 'BEGIN {A=0;} /^HOST/ {A=1;} /^DATE/ {A=0} {if (A==1) print;}' | tr -d '\n' | sed -E "s/,\s*/\n/g" | sed -e 's/^HOSTS\s*://\s*//
Amit
  • 1,006
  • 1
  • 7
  • 13
  • UUOC. Always quote your shell variables. Don't use all upper case var names. You never need sed, etc. with awk. `{if (A==1) print;}` can be written as simply `A`. You don't need the spurious semi-colons in awk. Always use single quotes around scripts (e.g. sed), not double. `\s` is GNU sed specific so you should state that. – Ed Morton Jun 24 '16 at 15:57
2

Here is another sed script, that might work for you:

script.sed

/HOSTS:/,/DATE/ { 
    /DATE/! H;                        # append to HOLD space
    /DATE/ { g;                       # exchange HOLD and PATTERN space
             s/([\n ])|(HOSTS:)//g;   # remove unwanted strings
             s/,/\n/g;                # replace comma with newline
             p;                       # print
    }
}

Use it this way: sed -nrf script.sed yourfile.

The middle block is applied to line that are in the range between HOSTS: and DATE. In the middle block lines that do not match DATE are appended to the Hold-Space and the line matching DATE triggers the longer action.

Lars Fischer
  • 9,135
  • 3
  • 26
  • 35
1

Perl to the rescue!

perl -ne '
    if (my $l = (/^HOSTS:/ .. /^DATE:/)) {
        chomp;
        s/^HOSTS:\s+// if 1 == $l;
        s/DATE:.*// if $l =~ /E/;
        s/,\s*/\n/g;
        print;
    }' input-file > output-file

The flip-flop operator .. returns a number, in this case indicating the line number in the current block. We can therefore easily remove the HOSTS: from the first line (1 == $l). The last line can be recognised by the E0 appended to the number, that's how we remove the DATE:...

choroba
  • 231,213
  • 25
  • 204
  • 289
1
awk 'sub(/^HOSTS: /,""){rec=""} /^DATE/{gsub(/ *, */,"\n",rec); print rec; exit} {rec = rec $0}' file
test-host
host2.domain.com
host3.domain.com
another-testing-host
host.domain.com
host.anotherdomain.net
host2.anotherdomain.net
another-local-host
TEST-HOST
Ed Morton
  • 188,023
  • 17
  • 78
  • 185